Not very long after finishing writing my lessons learned from building a Hello World Neural Network I thought that I could move on from a simple MLP to a more sophisticated neural net. Probably it was for Karpathy’s blog post about the unreasonable effectiveness of Recurrent Neural Networks that I chose to continue with an RNN.
To be honest, this wasn’t my only motive. Those who read my posts will already know that one of the problems I have studied extensively during the past few months, is the mailing list churn prediction using data from MailChimp. They were a series of posts in which I covered:
- How to Predict Churn: When Do Email Recipients Unsubscribe?
- How to Predict Churn: A model can get you as far as your data goes
- Predicting Email Churn with NBD/Pareto
- Recurrent Neural Networks for Email List Churn Prediction (This post)
TIP: If you want to have the series of posts in a PDF you can always refer to, get our free ebook on how to predict email churn.
So, churn prediction boils down to timeseries analysis and RNNs are doing great at these tasks.
What a huge coincidence, the choice was made!
Recurrent Neural Networks (TL;DR)
I’m not going to dive into a lot of technical details about what RNNs are and how they work as there are plenty of sources on line explaining them in detail.
I’m just going to mention that the distinguishing feature between RNNs and the other types of neural networks, is that only RNNs have a feedback loop.That means that at each time step a recurrent neuron receives the input as well as its output of the previous step. That’s exactly the reason why they are suitable for time series predictions.
A simple diagram representing exactly this process is the following:
Due to this feedback loop mentioned above, we can claim that the recurrent neurons have memory cells of some kind and so when making a decision they can take previous states into consideration too.
A specific type of neurons with memory cells are the so-called LSTM (Long short-term memory). No need to worry much about what they are or how they work. Just keep in mind that they are ordinary neurons with some extra properties that allow them to perform even better.
Collecting & Transforming data
The first step I had to take was the data collection and the preprocessing.
Although collecting the data wasn’t that difficult, finding out how it had to be formatted in order to be fed in the RNN was the most painful part of the process.
As in all previous posts, the type of data I chose to work with came directly from mailing campaigns launched through MailChimp. I used Blendo a data integration tool to sync MailChimp data with PostgreSQL in a few minutes, but only a subset of the tables, provided by the service, was used regarding the actions (opens or clicks) a recipient performed in a particular month. Here is a sample data model of the MailChimp Schema that I got in my DB built for me.
Based on this, my output would be a prediction of who is going to perform at least one action in the next month so that those who aren’t going to open an email or click a link will be considered of high risk.
What was troubling me the most was the fact that in my case I wanted my input to be a large number of uncorrelated (?) timeseries, each corresponding to a recipient.
Well, a correlation might actually exist but in any case, I didn’t want to consider the timeseries of a recipient as a feature for predicting another recipient’s behavior.
After a lot of googling, multiple experiments and equal failures I ended up with the following format. Each column of my input represents a recipient while each row a month. The value of each month is set to 1 when a specific user has opened an email or clicked a link at least once during this month.
John | Helen | Paul | Martha | |
January 2017 | 1 | 1 | ||
February 2017 | 1 | 1 | ||
March 2017 | 1 | 1 | ||
April 2017 | 1 |
Another problem I had to face was the fact that the length of each timeseries wasn’t the same, as not all recipients were subscribed during the same calendar months. I overcame this by padding shorter timeseries with zeros.
But how many months will you have available? 3 years? 5 ? 10 ? Even 10 years are only 120 months, i.e 120 data points. Studying data at a more granular level, for example, days, seemed a better option.
Even then, the clean data I had at my disposal was about 2.5 years resulting in less than 150 data points.
Although training an RNN with so little data seemed like mission impossible, I gave it a try and the results are following.
Building the network
For building the RNN, Keras was my choice as it is very easy and straightforward to use. I built a linear stack of three layers instances, including:
- LSTM: The main recurrent layer of the constructed network.For more information refer here.
- Dense: A regular densely-connected layer. For more information refer here.
- Activation: Specifies the activation function that will be applied to the output. I chose the ‘linear’. It is probably the simplest and seemed to behave quite well.
The number of the input neurons was set to 1349, the size of the dataset and the neurons per hidden layer 300.
To sum up the whole model constructed is the following:
Layer (type) | Output Shape |
lstm(LSTM) | (batch size, 300) |
dense (Dense) | (batch size, 1349) |
activation (Activation) | (batch size, 1349) |
Training & Evaluating
Before moving to the actual training, I had to split the data into train and test. In the evaluation of model’s performance, this is an important step that cannot be omitted although the dataset is already too small.
The chosen loss function was the mean square error which aims to minimize the average of the squares of the errors during back propagation. More information about the available loss functions can be found in the Keras documentation.
After training the model for 15 epochs the following graphs were generated for the train and the test dataset.
Good news is that the loss of the train set decreases as the epochs go by. The bad news is that unfortunately on the test dataset the loss does not follow a steady decreasing progress but instead, it has its ups and downs.
This behavior has quite expected and indicates that the network isn’t actually learning. Instead it “memorizes” the training points and cannot generalize on new unseen data. This a consequence of the insufficient amount of data I had available for this problem. Given a larger amount of data the network could possibly be able to generalize and generate better predictions on the test dataset.
On the bright side, according to my intuition and not some super scientific explanation, the fact that the network seems to overfit the train data as expected indicates that the code is more or less sane and can be probably used with a few modification in some other similar problem.
Conclusions
In this post, I tried to cope with a churn prediction task, made mistakes and learned from them. I also had the chance to play (scratched the surface) with Recurrent Neural Networks, a technique of immense value to intelligent systems, and how they work.
From the results, it seemed that the amount of data available was insufficient and so, in order to move forward, one suggestion would be the data enrichment. Data from MailChimp can be combined with behavioral data from other sources like Μixpanel and Ιntercom. This way, the timeseries will be expanded by including more data points for each user and so the model will perform better and better and finally predict churn.
But again the success is not guaranteed as training neural networks indeed requires a large amount of data.
In any case, I would encourage you to get the code, which is available on GitHub, and try to implement it on your own data. I would love to hear your comments regarding how the model performed on different churn tasks, so please share your experience in the comments!