Clickstream Analysis and Data Mining Techniques 101: An Introduction

Eleni Markou

The elusive clickstream data. Many platforms, like Facebook and Google Ads, rely on these generated data from what a user clicks and what doesn’t. To start analyzing clickstream data, we need first to be able to capture step by step a user’s activity across a web page or application. And that is of great value in the hands of any internet marketer. Getting a 360-degree view of a customer by knowing what he is clicking and what he is not, can get you a huge improvement in both your products and your customers’ experience.

Data Collection

Either you have your data in your data warehouse, or you need to enrich it with more data sources you need to have a way to collect and store data consistently into a database.

Data Preparation

Raw data is like a rough diamond; It requires some refinement before being truly valuable.

In the data world, refinement includes data processing, cleaning, and transformation of the initial data into something convenient for the analysis you are going to carry out.

In this case, we would like to have our data grouped into sessions. It would be good too, f we could arrange the events of each session in time order before moving to actual analysis.

In the above description, it is important to define what do we mean by the term “session”.  As a session we can consider:

  • The time between two consequent application start events in the case of an application or
  • The time from entry until logout or timeout (e.g., 20 minutes of no activity) in the case of a web page.


In contrast to other data sequences, clickstream data can have varying lengths for every different session and different users.

In order to transform the initially collected event log into clickstream data we need to:

  • Identify events/actions performed by the same user and group them together
  • Split them further into subgroups of events based on which of those were performed during the same session according to the session’s definition given above.

At this point the dataset we are going to use for the rest of the analysis should look like this:

Session1  A8
Session2  A14 A4  A8 A11 A12
Session3  A14 A4  A8 A11 A12
Session4  A14 A4  A9  A8  A9  A8 A11 A12
Session5  A14 A4  A9  A8 A11 A24  A9  A9  A8  A1 A14 A4  A8 A11 A12

In this representation, each line corresponds to a session. The first field is the session’s name while the next fields the actions performed by the user during this session.

Model Construction

As in most cases, the methods we can deploy for solving this problem are many. In this post we are going to evaluate two of them as they are widely used and easy to understand:

Markov Chains

The type of data Markov Chains work with are sequential data, the type of data we are dealing with at this post.

Markov process is a stochastic process that satisfies the Markov property of memorylessness. A Markov chain is, in fact, a Markov process too in either discrete or continuous time with a countable state space.

Graphically can be represented as a transition diagram along with the corresponding probabilities:

Markov Chain Example

In clickstream analysis, we usually utilize these Markov Chains. The process X(n) takes the state m(n)  from a finite set m at each time n. The order of a Markov Chain is derived from the number of recent states on which the current state, we assume, depends. Based on this, zero-order chains imply that the probability of being in a state in the next step is independent of all previous states.

Higher-order Markov Chain introduced by the Raftery (1985) will lead to more realistic models. At the same time, the parameters needed for the representation increase exponentially and so it is important to find the right balance between these two.

Fitting a Markov Chain

As mentioned before at this point our dataset looks like:

Session1  A8
Session2  A14 A4  A8 A11 A12
Session3  A14 A4  A8 A11 A12
Session4  A14 A4  A9  A8  A9  A8 A11 A12
Session5  A14 A4  A9  A8 A11 A24  A9  A9  A8  A1 A14 A4  A8 A11 A12

We chose to use 3rd order Markov Chain on the above-produced data,  as:

  • The number of parameters needed for the chain’s representation remains manageable. As the order increases the parameters necessary for the representation increase exponentially and thus managing them requires significant computational power
  • As a rule of thumb, we would like at least half of the clickstreams to consist of as many clicks as the order of the Markov Chain that should be fitted. There is no point in selecting a 3rd order chain if the majority of the clickstream consists of 2 states and so there is no state three steps behind to take into consideration.

Fitting the Markov Chain model gives us the transition probabilities matrices and the lambda parameters of the chain for each one of the three lags along with the Start and End Probabilities.

Start and End probabilities correspond to the probability that a clickstream will start or end with this specific event.

The transition probability matrix can be represented as a heat map with the y-axis representing the current state and x-axis the next one.

The more red-ish the color, the more probable the indicated transition will occur. For example, the transition from Action 23 to Action 1 is very likely while the transition from Action 21 to Action 10 is not.

Transition Probabilies heatmap

On the other hand, the Akaike’s information criterion (AIC) and Bayes’ information criterion (BIC) computed based on the log-likelihood can be used to compare two fitted Markov Chain models.

Predicting Clicks

In clickstream analysis, it is often very useful to predict the next click or final click (state) of a user given the pattern he has followed until now. In this way, data-driven personas can be constructed that will incorporate users’ behavior.

Along with the click prediction, the probability of transition is also calculated.

This can be extended to find the most common use case scenario of an app or a web page.

Starting from the state with the highest start probability and following the most probable transitions we end up with the data-driven persona of the

Clustering Clickstream Data

In most cases, due to the complexity of websites or applications, the same clickstreams are difficult to occur as the different paths a user can follow are many. A large number of monitored clickstreams makes the analysis more difficult unless we group together similar clickstreams and user profiles.

This way a company can:

  • Find customer segments
  • Identify communities of visitors with similar interests

In our case, we performed a k-means clustering with 2 centers. At this point, it is important to provide a meaningful interpretation for each of the clusters.

For our clustering, we noticed that cluster by cluster the average length of clickstreams increase. This means that k-means clustered the clickstreams based on the number of actions the user that produced them performed during a session.

Graphically the clusters can be represented as:

Clickstream Clustering

The y-axis represents a unique identifier for each session while the x-axis corresponds to the total number of states changed during each session.

Based on this, the light blue can be interpreted as users that perform a few actions and probably don’t spend much time on our page or application. Especially, for an application, it may represent users that achieved their goal easily and had no problem using the interface.

The dark blue cluster represents users that perform many actions and spend much more time navigating.

This interpretation can completely change on different data. In fact, there is no rule of thumb about how to perform the cluster interpretation and in most cases requires a deep understanding of the data and field expertise.

Mining Associations with cSPADE

Instead of modeling clickstream data as transition probabilities we can represent them as sequential patterns. We can then mine sequential patterns for finding those patterns that have particular minimum support, in other words, occur a small number of times in user’s clickstream data.

SPADE algorithm mines frequent patterns utilizing temporal joins along with efficient lattice search techniques. In the first step, the algorithm computes the frequencies of sequences with only one item, in the second step with two items and so on. (Zaki, 2001).

With this approach, a company can explore, understand or predict the visitor’s’ navigation patterns through a website or application.

Using cSPADE algorithm, we extracted all pattern sequences having minimum support that we defined.

For example, the following 22 pattern sequences are supported at least 40% of the clickstreams.

<{A1}> 0.5632184
<{A11}> 0.8390805
<{A12}> 0.7241379
<{A14}> 0.8390805
<{A4}> 0.8505747
<{A8}> 0.4137931
<{A1},{A4}> 0.4827586
<{A14},{A4}> 0.7356322
<{A1},{A14}> 0.4712644
<{A11},{A12}> 0.7241379
<{A14},{A12}> 0.6206897
<{A4},{A12}> 0.7011494
<{A14},{A4},{A12}> 0.6206897
<{A4},{A11},{A12}> 0.6091954
<{A14},{A11},{A12}> 0.4482759
<{A14},{A4},{A11},{A12}> 0.6896552
<{A1},{A11}> 0.4482759
<{A14},{A11}> 0.8045977
<{A8},{A11}> 0.4022989
<{A14},{A4},{A11}> 0.4367816

For a given sequence pattern S, we can predict the next click by searching for the pattern sequence with the highest support that starts with S.

For example,  after having just performed (A14) the most probable next action is (A4) according to pattern sequence 8 with support 73.5%.

Of course, lowering the support would give us pattern sequences that are less frequent in our clickstreams. This may be useful in the case we want to extract pattern sequences that yield errors or failures in our software.


At this point, we have explored Markov Chains and SPADE algorithm for mining Clickstream sequence data. By deploying the proposed models we can:

  • Extract data-driven personas for the most frequent digital journeys of our customers through an app or a web page.
  • Predict the next actions based on those performed so far.
  • Extract frequent sequential patterns

However, all this analysis is worthless unless it is driving actions. Based on the above results a recurring process of reviewing must be initialized regarding web or application design and content and marketing strategy.

For those interested to investigate further, as an alternative method for this analysis the computational search algorithm, considering a Genmax type approach can also be considered.

The code of the post can be found on Github.

Happy Clickstreaming!

Get our latest posts in your email to stay on top of everyone.