Can I predict churn? Having an email list and being able to predict my churn, is a valuable tool in the hands of any marketer. The idea of predictive analysis and its application in email marketing is not new. The aim is to formulate a more effective strategy by modeling customers’ or consumers’ behavior based on historical data, using statistical algorithms and machine learning techniques.
Update: for the next part of Predicting Churn series post check the How to Predict Churn: A model can get you as far your data goes…but still you may think you have a heck of a model.
In email marketing, a marketer may be particularly interested in evaluating the performance of past campaigns and predict the outcome of future ones. One common metric for this is the propensity for an email recipient to unsubscribe or churn. Armed with this knowledge, companies can adjust the frequency of which they sent emails or consider using other communication channels such as social media.
In the following example, we attempt to build a model to classify the email subscribers in those who will probably churn and those who won’t, based on data collected from an email marketing campaign launched through MailChimp.
This is post #1 in the series of mailing list churn prediction using data from MailChimp where I cover:
- How to Predict Churn: When Do Email Recipients Unsubscribe? (This post)
- How to Predict Churn: A model can get you as far as your data goes
- Predicting Email Churn with NBD/Pareto
- Recurrent Neural Networks for Email List Churn Prediction
TIP: If you want to have the series of posts in a PDF you can always refer to, get our free ebook on how to predict email churn.
First, we need to get access to the Mailchimp email marketing data that we’ll use for our analysis. To do that, we used Blendo an data management platform, to load our email marketing data from Mailchimp, into a PostgreSQL database.
From the data that Mailchimp provides, we selected the following:
- The List members: Mailchimp organizes the recipients into lists. Each mailing list contains information about the location of recipients (latitude, longitude, GMT offset, time zone, country code), email address and client in use , statistics about average open and click rate as well as his current subscription status (unsubscribed, subscribed, pending or cleaned).
- Email activity: A set of reports that contain all the actions the users perform, like opening an email or clicking a hyperlink along with the corresponding timestamp. That can be perceived as a time series that holds information on how the email recipient has interacted with the email marketing campaigns over time.
With our data stored consistently into a database like PostgreSQL, we can move forward and join the two tables together.
Exploratory Data Analysis
Exploratory Data Analysis is a visual approach to data investigation to maximize our insights, reveal the underlying structure and construct testable hypotheses.
The tools that we’ll use include the following:
- The so-called Five-Number Summary
- Histograms graphs
- Pairwise Scatterplots
Before we start exploring our data, we define the variable
churn as the output variable with the following possible values and their interpretation:
- 0 for those recipients that will not churn.
- 1 for those who will churn by discontinuing their mailing list subscription or being “cleaned” for any reason. For example, if someone is bouncing permanently, like an invalid or blocked email, MailChimp automatically removes them from the mailing list.
Of course, we do not expect all variables to have predictive power, but part of this exploratory phase is to figure out which ones do have and try to generate the intuition about why this is happening.
So let’s start.
Taking a look at the data above we can draw the following conclusions:
- All users have VIP status equal to 0. This means that this variable is useless for our analysis and thus can be excluded as it gives us no information.
- Over 75% of the recipients, has an average click rate equal to 0 (3rd Quantile: 0). Given that those who churn are significantly less, we are suspicious that even for indifferent recipients the click rate is probably not a good indicator about whether they will churn.
- The distribution of both locationlatitude and locationlongtitude indicates that the recipients are not spread equally around the globe. In fact, the majority is located in the north hemisphere.
The map verifies the assumption we made that the recipients are not spread equally around the globe. Green dots indicate those who will probably remain subscribed while blue ones those who won’t.
An interesting conclusion was extracted about whether the mailing list in which a recipient belongs affects his propensity to churn. It seems like members of the mailing list on the left tend to churn more easily.
This can be a consequence of a wrong mailing list segmentation resulting in irrelevant content being sent to target audiences.
Member Rating Scatterplot
As for the member_rating field according to Mailchimp documentation, new subscribers start out with 2 stars when they get added to a mailing list. The rating falls only because of soft or hard bounces and spam complaints.
Based on this no wonder why the majority of users with score 1 have discontinued -or Mailchimp has for them- their newsletter subscription.
On the other hand, positive scores are assigned to users who seem to be more engaged. According to Mailchimp documentation again, the engagement is measured using open and click data against the marketer’s sending frequency. The interesting fact is that from the recipients who churned, the number of those who have rating 2 is almost equal to those who have rating 3, in contrast to the recipients who remain subscribed the majority of whom has rating 2.
That leads us to the conclusion that the email campaign probably failed to maintain the interest of recipients who at some point seemed engaged.
Correlation Plot of Churn predictor variables
Another type of plot that is proved very useful is the correlation plot which gives us information about whether there is an underlying linear relationship between variables or not. The different color codes indicate whether the existing relationship is positive or negative. From the correlation plot above we can conclude that the variable
churn seems to have a significant linear correlation with variable
location_gmtoff the linear correlation is close to zero(0).
Feature Engineering is about combining features in the most imaginative way to squeeze just a little more information out of them. It is maybe the most difficult part of a data science problem as it depends on the problem we are trying to solve and how well we understand the data.
Our dataset contains columns with time information all expressed in UTC timezone. At that point, we are going to express each timestamp to the correct timezone. In that way, our timestamp becomes more meaningful for the certain email recipient.
Detection of business or personal emails
At that step, we tried to classify the emails of customers as personal or business. The implementation is based on a list of the most common email service providers. If the service provider of a recipient belongs to this list, then his email is classified as personal, otherwise as business. The implementation at this point is quite naive but gives as a good place to start. The results of this classification are stored in the
Aggregated action fields
Additionally, we computed a number of aggregated fields to supplement our data. These fields are the following:
- the total times a recipient opened an email.
- the total numbers a recipient clicked a hyperlink.
- the total numbers email bounced per recipient.
Dealing with missing values
Digging into each variable, we realized that many of them contain a large number of NAs. NAs in R correspond to “Not Available”, somehow close to the definition of Null found in other programming languages.
Having NAs in a dataset is very common, and they can be produced in many ways. For example, when a recipient lets a field in his profile blank, NA value is assigned to this field.
Although very common, missing values can cause plenty of problems, that can be quite difficult to deal with at times. To eliminate this issue, we can either choose to impute the missing values or exclude the whole column if the number of NAs per column is high compared to the total observations.
In our case, we chose to remove columns with a percentage of NAs above 50%. After the removal, the NAs were eliminated entirely and so no imputation was needed.
Splitting dataset in training and test
It is important to be able to assess the quality of a predictive model by simulating its performance on “unknown” data. This can be done by splitting the initial data. In our case, we chose the 25%, in order to utilize it for assessing the so-called out-of-sample performance of a model.
Modelling with Random Forest
As stated above, our output variable
churn, is binary, taking the value 1 when a recipient churns and 0 the other way.
The method we chose to utilize for modeling, is a Random Forest as it demonstrates certain advantages compared to other algorithms, including resistance to overfitting. Dependency between variables is also not a problem as each of the decision trees will take any split and will continue splitting further, finally finding a perfect separation.
The power of random forests lies in the fact that although a single decision tree may be imperfect, the most common response among many trees shall give a superior prediction. Each tree can be visually represented as:
The different criterions correspond to different variables of the dataset used for training, with “Criterion 1” being the most important according to the algorithm. At each node, the algorithm decides which is the “perfect” split and continues the execution. The criterion for choosing the perfect split is the maximum decrease of gini.
To make these trees different, random forest algorithm does the following:
Bagging: The algorithm selects a randomized training set so that each tree is trained differently.
Different variables: For the construction of each tree, the random forest algorithm uses a different subset of available variables (typically approximately the square root). In that way, “strong” features do not dominate the decisions in all trees as in some of them these features may not be selected.
In our case, the dataset wasn’t really big, so we chose to generate 2000 trees, a quite large number. In the case of larger datasets, computational limitations would force us to reduce the number of trees at least for the initial exploration.
Modelling with Random Forest – Mean Decrease Gini
For assessing the predictive ability of a variable we used the MeanDecreaseGini measure. Every time a split is made on a variable the gini impurity criterion for the two children nodes is less than for the parent node. Summing the gini decrease for each variable from all trees in the forest we get the variable importance order, as shown below.
Based on this we reran the random forest only with the previously selected variables. According to model’s output the OOB estimate of error rate is 4.5%, and so it performed pretty well on the OOB observations, the test dataset random forest algorithm evaluates on its own.
After building a model, it is important to evaluate how well this model behaves. This constructive feedback is the main principle of predictive modeling. Based on the output of various performance metrics we can further improve our model until we get the desired results.
As there is nothing like “perfect” metric, it is important to take into consideration various metrics, each one of them evaluates the result from a different aspect. In our case, we used the most common evaluation measures for classifiers, and the results are presented below.
Accuracy is the most intuitive metric as it represents the percentage of correct out of the total guesses. In our example, accuracy is equal to 0.97. But don’t get too excited! In many cases, like ours, where the event we are modeling is quite rare, high accuracy doesn’t always imply good predictive ability. That is why we need to compute some other metrics too.
Recall or Sensitivity
The recall derived from the above model is equal to 0.86. This value is equal to the fraction of relevant instances that were retrieved. In simple words, the percentage of true positive results the algorithm retrieved out of all positive results. As a percentage, this value varies between 0 and 1 with 0.86 being a good score.
The precision derived from the above model is equal to 0.96. Precision, like recall, is also a measure of relevance and is equal to the fraction of retrieved instances that are relevant. Simply put, the percentage of true positive guesses out of all positive guesses.
F1 is a measure less intuitive that those presented before which takes into consideration recall and precision and is defined as their weighted average. F1 score reaches its best value at 1 and its worst at 0. In our case, F1 was computed as 0.91, which is a very satisfying score.
Area Under the ROC Curve (AUC)
Maybe the less intuitive measure if all is the Area Under the ROC Curve (AUC). AUC computed as the area under the ROC curve taking its maximum value at 1. In our case, the AUC was computed as 0.93, a very high score.
Generally, we can evaluate the model as excellent both on the OOB points the random forest algorithms uses on the fly as a test dataset and on the main test dataset, we constructed at the beginning.
Being able to predict if a mailing list member will churn is an important tool in the hands of any marketer. What we’ve seen, is that it is possible to do it, by using the right tools that data science is offering. Some things to keep in mind if you decide to use this or similar techniques for churn prediction.
First, you need to keep in mind that predictions are an iterative process. Your models, no matter how good, they need to be updated as new data is generated. So, use the model evaluation techniques that are mentioned and always test with the new data to see if you should reconsider your approach.
Then, you need to understand that what might work for one dataset might fail completely for another, even if we are working on the same data source. The way that list members of a mailing list about electronics behave might differ completely on how teenagers who care about snickers do. This difference in behavior should be hidden in the data and will significantly affect your work in trying to predict it.
No matter how good your data is, it most certainly is not complete. Always try to enrich your datasets with new sources. This is an area that a data platform like Blendo really shine. As you can easily pull more data into your data warehouse and all you have to do is to focus on experimenting with your models.
Get the code
Finally, do not forget to get the code from Github that generates the above models from Github and use it on your own data. If you find it useful, please share the love and leave a comment!