…but still you may think you have a heck of a model.
In the latest post of our Predicting Churn series articles, we sliced and diced the data from Mailchimp to try and gain some data insight and try to predict users who are likely to churn. In principle defining churn is a difficult problem, it was even the subject of a lawsuit against Netflix1.
However, in the case of email marketing, the task is seemingly easier, as a user can be considered as churned when he unsubscribes from the list. Having a clear definition of what churn is in our case, we can proceed and start working with the available data. At the previous post, and following a long process, we ended up with a satisfactory result.
In data science it’s imperative to have a feedback loop in place, where we try something -> get feedback & results -> learn from feedback & results and then try something new.
This is post #2 in the series of mailing list churn prediction using data from MailChimp where I cover:
- How to Predict Churn: When Do Email Recipients Unsubscribe?
- How to Predict Churn: A model can get you as far as your data goes (This post)
- Predicting Email Churn with NBD/Pareto
- Recurrent Neural Networks for Email List Churn Prediction
TIP: If you want to have the series of posts in a PDF you can always refer to, get our free ebook on how to predict email churn.
Like many other problems in data science, there is no silver bullet method for predicting churn. My feedback loop was the awesome Redditors in the Data Science subreddit. From the feedback we got, we did not take into account that the data are serially correlated. That means they may have an internal structure such as autocorrelation, trend or seasonal variation.
There is a crack in everything. That’s how the light gets in.2
After this realization, to obtain an understanding of the underlying forces and structures that produced the observed data and fit a model for forecasting we will try a different approach.
To reflect the sequential nature of the data, in the case of an email campaign it is important to understand that each subscriber must be represented by multiple data points inside the dataset, each one of them corresponding to a certain value of the time indicator we are going to choose.
Monthly or yearly intervals, days of subscription or an email “serial number” of emails received, can account for appropriate “time indicators”. The appropriate indicator depends on the data we have.
For example, in a case where we have only two years of data from Mailchimp, the yearly intervals may be too broad.
Based on this, the selected indicators are:
- Days of subscriptions
- Email serial number: It is an ordinal number where 1 corresponds to the 1st email, 2 to the 2nd, etc.
For this analysis, we consider the email serial number as more suitable as in the context of an email marketing campaign. A subscriber is more probable to churn exactly after he receives an email he considers irrelevant than on a random day.
Moreover, this choice may lead to better actionable insights as it will inform a marketer about those who are more likely to churn at his next email.
Of course, we encourage you to make different choices and evaluate which indicator works best for you.
The first steps of our analysis are quite the same as our previous post, so we will run through them with fewer details.
The data from Mailchimp we utilize are the same as before. We also included the serial number of the emails each member of a mailing list received before churned. We also have the corresponding sign-out timestamp from the
Unsubscribes table. The
Unsubscribes table keeps track of the email of each churned recipient, the time at which they churned, the reason and some other information.
We used Blendo as an ETL as a service platform to store our email marketing data from Mailchimp, into a PostgreSQL database consistently, and we can move forward and join the two tables together.
Now, we are going to evaluate the underlying structure of the data from the scope of our new variables. The conclusions we drew at the previous post still hold, and in addition to those, we will investigate our new variable and maximize our insight about them.
The new measure we introduced, the
days_since, reveals that there are certain time periods during which members are more likely to churn. For example, the “high risk” periods seem to be:
- The early days of subscription: During these days one is probable to realize that he is not interested in the content of the emails he receives and so unsubscribe.
- After a year of subscription: Perhaps it refers to users who have already interacted with the company and are unhappy with their services.
As far as the serial number of the email is concerned, we can see that the max density is reached at less than five emails both for those who have churned and those who have not. That means there is a significant number of recipients who have subscribed quite recently and thus haven’t yet received many emails from the campaigns.
The last observation may prove to be a problem in our analysis as it seems that our time series have very few data points. So predicting the future behavior of customers based on them will be difficult.
On the other hand, having many years of good data makes predictions more accurate and reveals the actual predictive ability of each feature.
Feature Engineering is about combining existing features into new ones that are more meaningful. That is a crucial part of our analysis as the quantity and quality of the features we use, will influence the results we will achieve. The way in which the existing features can be combined depends entirely on the problem we are trying to solve.
|`mailCount`||This value corresponds to the serial number of each email for each recipient.|
|`personalMail`||Detection of Personal/Business email. If the service provider of a recipient belongs in a list of the most common, then his email is classified as personal, otherwise as a business.|
|`daysSub`||The number of days the recipient remained subscribed.|
|`totalpractions`||The total number of actions the user has performed until now.|
|`avgactions`||The average number of actions the user performs per email until now.|
|`days_since`||Number of days days since the last email he received.|
At that stage we are going to transform the data in a way so that we can handle them in R. The two steps we followed are:
Dealing with NAs
The existence of NAs is a very common problem in every dataset in the real world. Especially when the data come from forms the users fill in, blank fields are in most cases present.
Depending on the feature, NAs are handled differently.
For example, the
timestamp_out variable will have NAs in cases where the recipient has not churned. The same applies to the variable
totalpractions for the users with not even one interaction with the email.
On the other hand missingness on the
email_address, which is the primary identifier for the users, cannot be handled and thus the record will be removed as we do not know who it is.
Splitting in train and test dataset
At this stage, the initial dataset is split into train and test dataset. The train dataset is used for model construction and the test for model evaluation. Evaluation of the model’s performance on unseen data through test dataset is necessary before deploying the model to assess the quality and the trustworthiness of our results.