Supercharge your Lead Scoring with Data Science

Eleni Markou

Lead Scoring and the idea behind is not at all new. Its computation is a common practice in many companies. The reasons vary but they all come to this. Every company loves to have high conversion rates. Not only because this means that many new customers are attracted but also because it implies that your sales team did a great job reaching people that were indeed more susceptible to convert than others.

Since in business, not all leads are equal, being able to separate the most willing from those who aren’t, highlights the need for a ranking system which supports the company in prioritizing the existing leads according to their probability to convert. Then segment them in those who are ready to become customers, those who need further nurturing and those who are not likely to buy soon.

But how can we practically separate the qualified leads from those “less qualified”? That’s the million dollar question.

What is Lead Scoring?

This ranking is called Lead Scoring and its computation is a common practice. For each lead, a score is calculated which takes into consideration a large variety of information including personal details filled out on the company’s website, behavioral data, social information and demographics.

The sources from which you can get these types of data are numerous. For example, Intercom, Segment or Mixpanel can be used for obtaining behavioral data based on how a potential customer has interacted with your company and for what reason.

Another worth mentioning source mainly for demographics or other social data is Hubspot. Of course, the sources one can utilize are not limited to those mentioned but instead can be expanded to include support systems like Zendesk or accounting software like Quickbooks and Xero. Finally, the collected data can be further enriched with other services like Clearbit.

Getting back to the lead scoring calculations process, all this information is evaluated and a weight gets assigned to each different type of event or action, constituting the final score for each lead. The higher the score, the stronger the lead.

After defining the way in which the score is calculated, the process is supposed to be easy. Just repeat this computation for each and every lead. While this workflow perfect, it’s often easier said than implemented.

The need for predictive Lead Scoring

Selecting the right properties and their corresponding weights is not at all easy. It is highly correlated with each specific company and there is no such thing as one-size-fits-all. Instead, it requires much time, effort and constant revisions in order to finally conclude through “try and error” to the best formula. And even then you need to take care of the maintenance of your lead scoring formula. As your company evolves and so do your customers, the lead scoring calculation needs constant adjustments.

As your company evolves and so do your customers, the lead scoring calculation needs constant adjustments.

That’s, of course, inefficient and time-consuming.

Here is where data science comes in to save your day by automating the lead scoring computation based on the characteristics of leads that previously were proven to be strong, i.e. eventually ended up becoming customers. All the beauty lies in the fact that with predictive lead scoring you no longer have to figure out which specific attributes should be included or how much to weigh each one of them.

Of course, retraining will be needed at regular time periods but the effort requires is much less in comparison to the maintenance of an Excel formula.

Choosing the data

In order to deploy your own predictive lead scoring you need to make sure that you satisfy the following prerequisites regarding your data:

  • You have sufficient total data. Admittedly, what is sufficient and what is not is a bit difficult to specify. The more the data, the better your formula will get – and so having data for very few customers, e.g. 10 or 20, will end up in insecure judgments. Personally, I would expect to have at least 3 or 4 months of good, in terms of completeness, data.
  • You have sufficient data for each category. Having a lot of data is not enough if for example none of them have ever converted. How would one know what it takes to convert if no one is observed to do so? Having very little data of the one category or the other falls in the same category. Ideally, I would like to have an equal number of people who finally converted and of people who didn’t in order to be able to provide trustworthy predictions.

As the source of data I would recommend a combination of what is available and falls in the categories I mentioned before: personal details filled out on the company’s website, behavioral data, social information and demographics.

In this post, for demonstration purposes only, we are going to be limited only to behavioral data from Mixpanel.

Each data point corresponds to a lead and can be of the following two types:

  • Qualified: Includes those who are known to have converted already. For these, we take into consideration their last snapshot, exactly before they converted to customers in the sense that at that time they met all the conditions for becoming customers.
  • Unqualified (or not yet qualified): All the rest assuming that they haven’t yet reached a point that is supposed to be ready to convert and need further nourishing.  At this point, we need to pay special attention to data censoring.

Short parenthesis…

Briefly, we say that we have censored data when since a certain point in time we have not observed any important event. In our context, we have assumed that every action a customer takes is an important event (e.g. visiting the landing page, downloading a guide, logging into his account etc).

Suppose that a user performed an action last Friday and since then we haven’t heard of him. From last Friday until now any data we have collected regarding him is supposed to be censored and are not safe to be used as they are supposed to be incomplete. This means that regarding the unqualified leads we are going to use their “snapshot” at the time of their last important event.

Going back…

To summarize, the data we are going to use come from Mixpanel, as mentioned before, and include events of the following type:

  • The number of days subscripted to the company’s service after which the lead was converted to a customer, in the case of qualified leads, or the days since first time subscription until the last important event for unqualified leads.
  • The total number of visits in the web page.
  • Certain events that in a certain context trigger some specific behavior that is linked to the conversion of a customer into a paying one.

In general, the type of information included is subjective and depends merely on the type of the company.

Commonly used data include demographics, like age, profession or location and behavioral engagement events like:

  • Landing page visits
  • Downloads of brochures or e-books and webinar attendance
  • Visits on pricing pages
  • Number of days since last visit or interaction

Even in the case that you don’t know what type of data to use, you don’t have to worry.

Since modeling is an iterative process, you can always start with what is available and experiment with different types of data until you finally conclude what works best for your business case.

So, after having selected the type of data we want to utilize, based on the company’s special needs, and collected a sufficient amount of it, we can move forward.

Gaining insights

Before selecting the type of models we are going to deploy, it is recommended to perform some exploratory analysis in order to gain more insight regarding the data we are working on.

This is actually one of my favorite parts, as, through a series of visualizations, we can develop intuition in the lead scoring problem that will be proven very useful for the evaluation of the outcomes.

Correlation plot and scatter matrices, like those following,  were my choices

scatter matrices

correlation plots

Having a quick look at the plots we can see that although there seems to be a quite strong correlation between the number of deletions and the creations (see correlation plot), yet none of them seems to be linear (see scatter matrix).

Regarding the ‘qualified’ variable which is our target, it seems that it is positively correlated with session_count and negatively with the days passed since subscription. This means that the more the visits on our homepage, the more likely a user is to be interested in becoming a customer and by contraries, as the days pass since he subscribed for the first time, the probability that he will become customer drops.

Although these may seem quite obvious observations, imagine having 20 or 30 different variables instead of 5. In this case, defining all the relationships between them is not an easy task and moreover quantifying them is even more difficult.

Yet, what seemed like a surprise is the fact that the event_1 and event_2 do not seem to be that informative. Although a positive correlation with the ‘qualified’ target variable does exist, it is not that significant.  In this case, maybe we should consider adding some more features in order to capture more information.

Modeling Process

The models from which one choose are many and for sure includes everything that falls under the umbrella of survival analysis. Given that we have taken care of censored data we can also work simpler model such as logistic regression or any other appropriate classifier.

Personally, I chose to evaluate the logistic regression as its output was very convenient. A score, interpretable as the probability, would be ideal for my lead scoring purpose. The higher the score, i.e the outputted probability, the higher is the chance that the specific customer is ready to convert. Translating this into actionable insight, we can recommend the sales team to get in contact with him immediately and present some offer.

In Python the implementation is fairly easy, as shown below:

# Load necessary packages
import pandas as pd
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import matplotlib.pyplot as plt
import numpy as np

# Load Data & Split Train-Test sets
lead_score = pd.read_csv('PATH_TO_CSV', sep=",")
X = lead_score.ix[:, lead_score.columns != 'qualified']
y = lead_score.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

# Implement Logistic Regression
logistic = LogisticRegression(),y_train)

# Make predictions on the test set
predictions = logistic.predict(X_test)
prediction_proba = logistic.predict_proba(X_test)
y_pred = logistic.predict(X_train)

# Compute evaluation Metrics
print("Accuracy: %2f" % metrics.accuracy_score(y_train, y_pred))
print("Precision: %2f" % metrics.precision_score(y_train, y_pred, average="macro"))
print("F1: %2f" % metrics.f1_score(y_train, y_pred, average="macro"))
metrics.confusion_matrix(y_test, predictions)


Final Results

With the learning curves, we can compare the performance of a model on training and testing data over a varying number of training instances.

From the constructed learning curves it can be depicted that the training score is initialized around 0.94 for 100 training examples and then drops slightly as the number of training points increase. The exactly opposite course is followed by the cross-validation score. At the end, both training score and cross-validation score are stabilized at around the same value, namely 0.92.

For cross-validation, we performed 100 iterations in order to get smoother mean test and train and each time 20% was randomly selected as a validation set.

The above observations imply that the constructed model generalizes to new data (an increase of cross-validation score). The smaller the gap between the two curves, the better the model generalizes as in our case, it performs almost equally well on train and cross-validation dataset without overfitting the train set.

final results
A summary of the most important metrics that we evaluate shows that the simple logistic regression model did very well at the lead scoring task achieving an accuracy over 90%.


Evaluation Metric Value
Accuracy 0.92
Precision 0.87
F1 0.79



It is generally accepted that a lead scoring system can be very advantageous for a company as on the one hand, it boosts both sales and marketing effectiveness and on the other hand, it helps in the revenue increase. Regarding how its computation can be implemented in each company it turns out that traditional implementations involve complex and overwhelming computations and a lot of engineering while at the same time it’s difficult to maintain since the weighting of the events needs to be made manually each time.

Furthermore, simple machine learning models, such as logistic regression, can be a good starting point for experimentation with one’s data aiming to the automation of lead scoring computation which can be proven quite helpful saving a lot of manual work and effort.

Others methodologies such the ones suggested in the post are of course worth trying too. We would be more than happy to discuss with you any interesting conclusion you may come up to along with an evaluation of the model’s performance!

How to Predict Email Churn Free ebook

Get our latest posts in your email to stay on top of everyone.