Making Sense of Big Data

A thorough comparison between DL algorithms and LightGBM for tabular data for classification and regression problems

Here we go with yet another post in the series. This is the fourth of the series. The previous three posts, and the original version of this post are hosted in my own blog, just in case.

I started planning this posts a few months ago, as soon as I released what it was the last beta version (0.4.8) of the library pytorch-widedeep. However, since then, a few things took priority, which meant that to run the hundreds of experiments that I run (probably over 1500), took me considerably more time than I expected. Nevertheless, here we are.

All the…


This is the third of a series of posts introducing pytorch-widedeepa flexible package to combine tabular data with text and images (that could also be used for “standard” tabular data alone). The previous two posts, and the original version of this post are hosted in my own blog, just in case.

While writing this post I will assume that the reader is not familiar with the previous two posts. Of course, reading them would help, but in order to understand the content of this post and then being able to use pytorch-widedeep on tabular data, is not a requirement.

To…


This post and the code here are part of a larger repo called RecoTour, where I normally explore and implement some recommendation algorithms that I consider interesting and/or useful (see RecoTour and RecoTourII). In every directory, I have included a README file and a series of explanatory notebooks that I hope help explaining the code. I keep adding algorithms from time to time, so stay tuned if you are interested.

As always, let me first acknowledge the relevant people that did the hard work. This post and the companion repo are based on the papers “Variational Autoencoders for Collaborative Filtering”…


This post and the code here are part of a larger repo that I have (very creatively) called “NLP-stuff”. As the name indicates, I include in that repo projects that I do and/or ideas that I have — as long as there is code associated with those ideas — that are related to NLP. In every directory, I have included a README file and a series of explanatory notebooks that I hope help explaining the code. I intend to keep adding projects throughout 2020, not necessarily the latest and/or most popular releases, but simply papers or algorithms I find interesting…


The Focal loss (hereafter FL) was introduced by Tsung-Yi Lin et al., in their 2018 paper “Focal Loss for Dense Object Detection”[1]. It is designed to address scenarios with extreme imbalanced classes, such as one-stage object detection where the imbalance between foreground and background classes can be, for example, 1:1000.

In this post I will show how to code the FL for LightGBM[2](hereafter LGB) and illustrate how to use it. The companion github repo can be found here. Unfortunately, I could not find true challenging imbalanced datasets that are public. The datasets that come with the imbalanced-learn[3] package in python…


This is the second of a series of posts on recommendation algorithms in python. In the first of the series, that I wrote quite a while ago, I quickly went through a number of algorithms that I implemented and tried using Kaggle’s Ponpare dataset. You can find all the related code in the repo. In this post I will use the Amazon Reviews dataset [1] [2], in particular 5-core Movies and TV reviews, to illustrate the use of two DL-based recommendation algorithms: Neural Collaborative Filtering [3] and the more recent Neural Graph collaborative Filtering [4].

Before I move forward, let…


In our previous post we showed how one could use the Apache Kafka’s Python API (Kafka-Python) to productionise an algorithm in real time. In this post we will focus more on the ML aspects, more specifically on how to log information during the (re)training process and monitor the results from the experiments. To that aim we will use MLflow along with Hyperopt or HyperparameterHunter.

Scenario and Solution

A detailed description of both the scenario and the solution can be found in the post mentioned before.

In summary, we would like to run an algorithm in real time, and some immediate action needs to…


This is the first of 2 posts where we will illustrate how one could use a series of tools (mostly Kafka and MLFlow) to help productionising ML. To that aim we will set a simple scenario that we hope resembles some real-word use-cases, and then describe a potential solution. The companion repo with all the code can be found here.

The scenario

A company collects data using a series of services that generate events as the users/customers interact with the the company’s website or app. As these interactions happen, an algorithm needs to run in real time and some immediate action needs…


Some of my current music recommendations in Amazon. Not bad, but not perfect based on the music I have been listening in amazon music and/or purchasing lately in their site.

A while ago a friend of mine asked me about approaches that would be useful when optimizing GBMs. I had been asked this question a few times in the past, so I thought I could share some code and write a post about it, just in case someone finds it useful. During the process of re-writing the code, I thought I could give it some context and illustrate how one could optimise a GBM in the process of building a recommendation algorithm. I then finished the script and thought that I could also include some other algorithms and techniques used…

Javier Rodriguez Zaurin

Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store