## The Intro

Being in a company where everyone is an engineer, means that sooner or later you will have to deal with things that are completely out of your comfort zone, for me this happened when I had to start working with marketing for **Blendo**. I’m quite happy to learn new things no matter the domain, but with marketing, it was always like hitting a wall. Not only I didn’t know how to do marketing, from what it seems I was also completely lacking the learning mechanisms that could help me understand this domain. Now add to this the following:

- There’s a huge amount of information out there regarding Marketing, but after a while, I concluded that 99% of it is marketing material about marketing. Fluffy stuff that sounds serious (depending on who’s talking mainly) but at the end it leaves you at the same level of ignorance. Really frustrating I would say as I was blaming myself for not understanding for quite a while.
- Most of the people that I have access to and who have some knowledge on the matter, usually just tell you what you want to achieve but too little about how to achieve it. This is also a characteristic of (1).

To be fair, it might have to do with me, the way I have learned to learn so far and how I communicate with people. Nevertheless, I needed to figure out a different way of learning about how to do marketing, something that would probably require for me to abandon my comfort zone but at least it would feel closer to my way of thinking.

That’s how I decided to experiment with statistics and data science as a way of learning about marketing. But first, I should define better what I mean by learning.

By my perception, in order to be able to do marketing I need to achieve the following:

- Understand the purpose of it. This one is easy and I believe it was clear to me from the beginning. Marketing is the process of communicating with others (your customers) to raise their awareness about a very specific subject (e.g. your product).
- Figure out how to establish communication channels.
- Measure how your communication performs, do others get your message? Do they understand it?
- Find what parameters affect the communication process and how you can manipulate them to perform better.

At this early stage, I aimed to use statistical analysis for (3) and (4) as I consider (1) as a lifelong process where my understanding of what is marketing will change as more experience is acquired and for (2) my impression is that if you want to use data science for this you will need a lot of data, something that I’m lacking now.

Some more background information and a disclaimer before I close this quite lengthy introduction. I haven’t touched statistics for quite a while and although I consulted some really smart people who are awesome data scientists, I delved into this mostly on my own hoping that as I learn about a subject that I feel more comfortable with (statistics), I would also gain insights about something that I do not (marketing). So I take full responsibility of any mistreatment to the techniques and tools that I used for this.

**Let’s start.**

## Problem Definition

Marketing is a broad field, so we need to narrow down a bit our focus in order to build something meaningful. In **Blendo**, and I believe for many early stage companies, email marketing is one of the first marketing tools that we used, we keep heavily using it and we’d love to figure out a way to make it work better. We are using Mailchimp and although it offers statistics and metrics for the performance of your campaigns it doesn’t offer the depth that we’d like in order to better understand what’s going on. Also, e-mail marketing is a field with a great potential for statistical analysis as it includes so many different parameters, ranging from free text to typical demographic information, while it is possible to get access to the data that you will need for this job.

## Setting the goal

*Our goal is to understand what affects the open rate of a campaign and if possible to “predict” with some certainty if a recipient will open an email or not based on some parameters.*

The first goal when sending an e-mail, and the most obvious one, is to make sure that your recipient will open. If he doesn’t open it we certainly won’t read it, right? 🙂 Of course, we can define much more complex goals like,

- Click on the links that we have inside e-mail
- Do a specific action on a landing page when the recipient lands there from an e-mail
- Let your imagination wild…

But, opening the e-mail is the most basic one and a prerequisite for anything else that might happen after this event so will start with this.

## Make some hypotheses

Now that we have found our “outcome variable”, which is the opening or not of an e-mail, we need to build a set of “independent variables” that we would like to see how they affect it. Practically we’ll be doing some hypotheses and figure out how to represent them using the data we have available. At this point, it helped me a lot to make some first hypotheses the navigation of the data that I was extracting from Mailchimp. As I was doing some dogfooding and using Blendo to pull the data from Mailchimp, I had them in a relational database so it was quite easy to both observe the schema of the data and also run some queries (mainly aggregations) to see what kind of data I was dealing with. So my first hypotheses that I made were the following,

**What characteristics of the recipient affect the opening of an email**? At this point, we’ll be using only one for our analysis, the e-mail of the recipient and more specifically if we subscribed using her job email or not. The intuition here is that someone who has used the email of her job is more serious about the message that we want to communicate.**What parameters of our campaign affect the opening of an email**? Again we’ll start with just one parameter, the time that we decided to send the campaign. Is it better to send the campaign during working hours or the opposite?**How important is the type of the campaign**? This can be part of the above hypothesis, but I believe it deserves to be on its own mainly because it is related to the content and when we work with free text the possibilities are endless. For now, we’ll just try to figure out a way to use the structured data we already have to characterize the type of the campaign.

## Lessons learned

- I don’t know if the best strategy is to start from your data to find out your hypotheses but in my case it worked, it somehow helped to frame the problem and see what dimensions of it I would like to investigate. Probably I’m biased here as an engineer who works mainly with data, for someone coming from a different domain a different approach might be better.
- Defining hypotheses is different from defining variables. To be more precise, a hypothesis might contain many different variables. For example, the characteristics of the recipient might be all the demographics we can gather from him by combining the data from Mailchimp with our CRM. I most certainly will not come up with the best model by having just one variable for each of the hypotheses that I made, but it is a good starting point.

## The Methodology

At this point, I will probably disappoint you. I’d love to tell you that I used deep neural networks, that I crunched data on a cluster of GTX Titan X GPUs, that I also trained an RNN to write me an essay at the end on how I would maximize my campaign open rate, but I didn’t. First of all, I didn’t have the data for something like this, second although I still remember how to differentiate, it will require a lot of effort to work and understand these methodologies and third even if I tried to use something like that I most probably wouldn’t understand if and why it worked or not. So I needed to find a statistical methodology that would be simple enough for me to understand and use and which should have the following characteristics.

- It could be used to get some insights on how the open rate is affected by the variables I have defined.
- It could be also used to “predict” how a recipient would behave if she would open or not an email. Here we have a classification problem.
- It could potentially be used to update a model online and by using (2) to decide automatically when to send an email to someone using a service for transactional emails (ok here we had an engineer making wet dreams).
- It can be used as the foundation for learning more complex stuff in the future.

As I was working with a binary variable, an email with either be opened or not, I decided to go with Logistic Regression. By building a model using this technique, I can make a prediction if a campaign will be opened or not depending on the parameters that I have defined, there’s also a lot of literature out there on how to examine the contribution of each coefficient of the model and thus help me understand how each variable affects the marketing campaign. Also, it is possible to train “online” a Logistic Regression model using Stochastic Gradient Descent so both (3) and (4) requirements are also covered for the future.

## Lessons learned

Start with simple techniques, to be honest, I could even start with something even simpler like a Student’s t-test, but I had a blast from my past, as a student trying to use some tables to pass the exams on Statistics, so an emotional bias kicked in and decided to use something different. But seriously, it’s much better to start with something simpler and build on top of that as you better understand the tools you are using, simple techniques can get you a lot of mileage.

## Data Preparation

I already had my *Mailchimp* data into a PostgreSQL database and I had to figure out how to create my variables from the raw *Mailchimp* data. As in most cases, the data we have available contains implicitly the information that we need, so we have to extract it and represent it explicitly in a form that can be used by our model. Because I had the data already in a relational database, I’d like to do any join like operations on it, end up with the smallest possible dataset and then pull the data out and use Python, *Pandas* and what other libraries Python has to offer to build my model.

*Mailchimp* data models is built around the concept of Campaign. Whenever we want to send an email to a number of recipients we create:

- A campaign which includes the email we want to send, and
- A list of recipients that is attached to the campaign and who will receive the email.

*Mailchimp* also offers reports about a campaign, one of these reports the “Email activity” contains something similar to a log for each one of the recipients with events that correspond to how the recipient interacted with the email of the campaign. So for example, if recipient A has opened the email, from the report we will get an event of type “open” together with a timestamp of when that happened.

**Important**: *The way to understand if someone hasn’t opened an email is by observing that the event list associated with her e-mail is empty. So here we have some implicit information that we need to turn into explicit.*

Of course, all these entities contain information that we do not need and also information that we need to join across different tables/entities into one table containing the observations/variables that we will use for our model. Ideally, this table of observations will contain the following information for each one of our recipients across all our campaigns:

- A variable which will indicate if the email was opened or not by the recipient. We will call this column as “opened”
- A variable indicating the type of the email that was sent. More on this later. We will call this “isBlog”, true means that the email sent was about a blog post while when false the mail is about a product announcement.
- A variable holding information about the time the email was sent. More specifically if the email was sent during working time or not. The name of this variable is “duringWork”.
- And finally a variable holding information about the email address of the recipient, if it is a personal email account or a work related one. The name of the variable will be “businessAddress”.

None of the above information is explicitly declared inside the dataset coming from *Mailchimp* and thus we need to do some preparation to extract it. First of all, we’ll need to join data coming from the following entities/tables from *Mailchimp*:

**Email activity reports**. This is our event log**Campaign table**. Information related to our campaigns.**List members**. Information about the recipients of each campaign email.

The purpose of joining these tables together is to end up with a table that will contain all the information needed to calculate the variables we mentioned earlier and it should have the following columns:

**campaign_id****list_id****email_address****send-time****title****action**

The event logs we get from the Email activity reports contain information only about recipients who have interacted with the campaign, for this reason it’s not enough to just join these events with the campaign and lists tables, we also need to find the people from the lists that are not present in the event logs and add them as these are the people who haven’t opened our emails.

The title is used to infer the content type. In our case, we are using specific wording in the campaign titles depending on type so we can use that for figuring out if it is a blog or a product announcement. An even better approach would be to create a custom field in Mailchimp where you would add the type yourself and use this information directly. We didn’t do it so I had to infer the type from the title. Also, in my case I know that I’m having only two different types and for this reason is easy to make this variable a binary one, in case you had more than 2 different types, you would have to introduce dummy variables using something like the “get_dummies” function of Pandas with you can easily dummify such categorical variables.

At the end, we have a table which is a complete and enhanced event log of how our recipients have interacted with our campaigns. Now it’s possible to derive the variables we need.

- The “opened” variable is populated with true or false by simply observing if the “action” column has a value “open” or not. If it is completely empty we know by definition that the recipient hasn’t opened the email.
- The “isBlog” variable is populated with true or false by checking if the sub-string “New Post” is included in it. By definition again we know that if it’s not about a blog post then it’s a product announcement.
- The type of e-mail address is derived by just checking if the email address contains some pre-defined strings inside. So we create a small authority file containing emails address that usually people use for personal use, e.g. Gmail and yahoo.
- The period that the mail was sent is derived by checking the time it was sent between 9:00 and 17:00. If it was then the variable got the value true, otherwise false.

The query that generated the enhanced event log is executed in Python and it populates a Panda dataframe, from there we define some simple functions to bring our data to its final form including only the variables we need. The data preparation also includes some additional steps for dealing with missing values, dropping columns that we do not need at the end etc. These steps are omitted from here as they might differ anyway depending on the data you have and are quite trivial.

## Lessons learned

First lesson, everyone models the data differently depending on how they plan to use it. Mailchimp has its data model which is perfect for their application but if you plan to do something different you most probably will have to bring the data in the context of the problem you try to solve or the question you try to answer.

Second, I made a “mistake” with the “duringWork” variable, after I created it and by quickly browsing the resulting table it was obvious that all the campaigns were sent during working hours. Now there are two issues here,

- I should have done some exploratory analysis before I decided to use this variable if I did that I would know that all of my campaigns fall in this time interval.
- The way I was calculating this variable I was considering the time it was sent with reference to my time. That’s also explains why all of the campaigns were sent during working hours The correct way to do this is to figure out the timezone of each of the recipients and based on the time see if it was sent during their working hours or not.

Although what happened with the “duringWork” variable helped me to better define it, it wasted quite a lot of time that I could have avoided by doing the exploratory analysis that I mentioned in (1). So, always have a look at your data.

Finally, there’s no perfect way of *ETLing* your data from its source to your database. For this reason, it’s probably better to have an as much as possible raw version of your data and work from there to bring the data into the form that you need based on the analysis you want to perform.

## Generating and interpreting the model

Generating the model is the easiest part. It’s just one line of code where you define which one of the variables is the depended together with the data you have prepared for the training. After the process is done you get a fitted model and you can print some interesting results related to it that look like the following:

The information provided above will help us understand the relationship between someone opening an email we send and the variables we have defined. By using Logistic Regression we have assumed that there’s a linear relationship between the outcome variable and the predictors and we can use the concept of odds to interpret the results we get. But first, let’s start with the P-value. The P-value for each of the coefficients tests the null hypothesis the coefficient has no effect. The rule here is that if we have a P which is smaller than 0.05 then the predictor is likely to be a meaningful addition to our model, a predictor with P > 0.05 can be considered insignificant and we can remove it from the model. The results of our model indicate that the businessAddress is likely to be insignificant and we can omit it. So we managed to get a first insight,

*It’s quite likely that it doesn’t matter if someone has been subscribed using her business or personal email when it comes to opening the emails we send.*

We can get insights for the rest of the predictors by taking the exponential of the coefficient value to generate odds ratios. For the “isBlog” variable the odds ratio is 1.81 which can be interpreted as,

*The odds for someone to open an email are 1.81 times greater when the email is about a blog post than when it is for a product announcement.*

Based on the above result we probably need to sit down and think about what we do wrong when we communicate something directly about our product 🙂

Also because of the negative sign of the “businessAddress” it is likely that there’s an inverse relationship between this predictor and opening an email, which suggests that we should send the e-mails during non-working hours.

Finally, there’s the mysterious intercept that somehow we need to interpret. The best explanation that I found about the intercept is that it acts like a “garbage collector”, something that was music to my ears as an engineer, finally something that I could relate to. But to be honest regardless my enthusiasm, it’s will not help that much to know about the Java Garbage Collector to understand the intercept. The intercept represents any bias that is not accounted by the predictors we have defined, in my understanding the constant might indicate that we should seek additional or different predictors for our model.

We can use the above model to make new predictions but as it is a quite simple model we can just interpret the predictors and based on the results educate our next choices about what email to send and when.

## Lessons learned

Statistics is a process where you first have to codify your reality into numbers, throw the numbers into a model and then figure out how to interpret the resulting numbers into something closer to the reality again. This is exactly where a data scientist shines, I’m pretty sure that there are numerous details that I’m missing from the results I get from the model. It can be a completely useless model and it might take just a few seconds for a data scientist to tell you this by just reading the value of the LLR p-value.

It’s also an iterative process, you make some assumptions, you build your model, you interpret it, you take some actions and then you go back re-build the model, re-interpret etc. etc. I feel that no matter how good you are with statistics, creating a model that will guide your interactions with reality will take quite a few iterations.

There’s also something important that I haven’t touched so far which is the interaction between the predictors. I have assumed so far that there’s no interaction between the predictors, I could introduce this in the model but I avoided it mainly because as I kept searching for interactions and their meaning it got really philosophical and complex so I decided to not get into it for now.

## The outro & future work

It was a lengthy post but there were many things to go through if you managed to reach this point you are my hero! My final thoughts before I close this post.

I’m pretty sure that anyone with competencies in statistics and data science will find really elementary what I’ve done so far and probably this is right but for it achieved my initial goal. It helped me frame my problem of how to approach marketing in a way that is easier for me to comprehend. I’d love to have someone to tell me that when you run an email campaign you have to check this and that but at least for now I don’t have anyone. With this approach, I got stimulated to ask questions about what I’m doing and also have a guide on how to iteratively work on better understand and run marketing e-mail campaigns. Also, I might have done terrible mistakes, if it happens to find any please let me know you will help at least one person on earth to become better (that’s me) 🙂

Did I learn data science? Well, I’m definitely not becoming a data scientist quite soon but the whole process I went through has a number of benefits even if I never become really proficient in statistics:

- I learned a lot about my data and what information I can find hidden in it
- Even if I do not become a data scientist, having messed around with the tools they use and the data will help in the future to cooperate better with them and understand their work.
- Not having the funds to get a data scientist to work with you is not an excuse when you are coming from an engineering background, you can help yourself by using simpler methodologies and of course you can always ask when you don’t know something 🙂

You can find the script I used for the analysis **here**, it assumes that your data are stored in a database delivered there by **Blendo**. I understand that you might not want to give **Blendo** a try and that you would like to pull your data directly, although it makes me sad, so if this is the case let me know and I’ll come up with a small script to pull the data directly from Mailchimp.

There are still many things that I do not understand, like interactions for example but I plan to learn more as I iterate on the model. I would also like to create a much richer model with more predictors but this is going to be future work. If you have in your mind predictors that would make sense to include, again drop me a line!

I’m also really interested in the implementation part of generating these models. Creating the model was more like using a black box, I had mainly to work on preparing the data and then magic happened by invoking one line of code. I’d love to learn more in what’s happening behind the scenes and hopefully I’ll do that while implementing an online version of logistic regression where I’ll have to dive deeper into the implementation.

**If you liked my post give it some ♥**

**If you work with data, give Blendo a try and let us know what you think!** Reach out to us and we’ll show you around!

## Resources

[1] Logistic regression – interpreting parameters

[2] FAQ: How do I interpret odds ratios in logistic regression?

[3] Logistic Regression in Python

[4] How do I do online updates to logistic regression?

[5] Are there algorithms for computing “running” linear or logistic regression parameters?

[6] Regression Analysis: How to Interpret the Constant (Y Intercept)