The type of data Markov Chains work with are sequential data, the type of data we are dealing with at this post.
Markov process is a stochastic process that satisfies the Markov property of memorylessness. A Markov chain is, in fact, a Markov process too in either discrete or continuous time with a countable state space.
Graphically can be represented as a transition diagram along with the corresponding probabilities:
In clickstream analysis, we usually utilize these Markov Chains. The process X(n) takes the state m(n) from a finite set m at each time n. The order of a Markov Chain is derived from the number of recent states on which the current state, we assume, depends. Based on this, zero-order chains imply that the probability of being in a state in the next step is independent of all previous states.
Higher order Markov Chain introduced by the Raftery (1985) will lead to more realistic models. At the same time, the parameters needed for the representation increase exponentially and so it is important to find a right balance between these two.
Fitting a Markov Chain
As mentioned before at this point our dataset looks like:
Session2 A14 A4 A8 A11 A12
Session3 A14 A4 A8 A11 A12
Session4 A14 A4 A9 A8 A9 A8 A11 A12
Session5 A14 A4 A9 A8 A11 A24 A9 A9 A8 A1 A14 A4 A8 A11 A12
We chose to use 3rd order Markov Chain on the above-produced data, as:
- The number of parameters needed for the chain’s representation remains manageable. As the order increases the parameters necessary for the representation increase exponentially and thus managing them requires significant computational power
- As a rule of thumb, we would like at least half of the clickstreams to consist of as many clicks as the order of the Markov Chain that should be fitted.There is no point in selecting a 3rd order chain if the majority of the clickstream consists of 2 states and so there are no state three steps behind to take into consideration.
Fitting the Markov Chain model gives us the transition probabilities matrices and the lambda parameters of the chain for each one of the three lags along with the Start and End Probabilities.
Start and End probabilities correspond to the probability that a clickstream will start or end with this specific event.
The transition probability matrix can be represented as a heat map with the y-axis representing the current state and x-axis the next one.The more red-ish the color, the more probable the indicated transition will occur. For example, the transition from Action 23 to Action 1 is very likely while the transition from Action 21 to Action 10 is not.
On the other hand, the Akaike’s information criterion (AIC) and Bayes’ information criterion (BIC) computed based on the log likelihood can be used to compare two fitted Markov Chain models.
In clickstream analysis, it is often very useful to predict the next click or final click (state) of a user given the pattern he has followed until now. In this way, data-driven personas can be constructed that will incorporate users’ behavior.
Along with the click prediction, the probability of transition is also calculated.
This can be extended to find the most common use case scenario of an app or a web page.
Starting from the state with the highest start probability and following the most probable transitions we end up with the data-driven persona of the
Clustering Clickstream Data
In most cases, due to the complexity of websites or applications, same clickstreams are difficult to occur as the different paths a user can follow are many. A large number of monitored clickstreams makes the analysis more difficult unless we group together similar clickstreams and user profiles.
This way a company can:
- Find customer segments
- Identify communities of visitors with similar interests
In our case, we performed a k-means clustering with 2 centers. At this point, it is important to provide a meaningful interpretation for each of the clusters.
For our clustering, we noticed that cluster by cluster the average length of clickstreams increase. This means that k-means clustered the clickstreams based on the number of actions the user that produced them performed during a session.
Graphically the clusters can be represented as:
The y-axis represents a unique identifier for each session while the x-axis corresponds to the total number of states changed during each session.
Based on this,t he light blue can be interpreted as users that perform a few actions and probably don’t spend much time on our page or application. Especially, for an application, it may represent users that achieved their goal easily and had no problem using the interface.
The dark blue cluster represents users that perform many actions and spend much more time navigating.
This interpretation can completely change on different data. In fact, there is no rule of thumb about how to perform the cluster interpretation and in most cases requires deep understanding on the data and field expertise.
Mining Associations with cSPADE
Instead of modeling clickstream data as transition probabilities we can represent them as sequential patterns. We can then mine sequential patterns for finding those patterns that have a particular minimum support, in other words, occur a small number of times in user’s clickstream data.
SPADE algorithm mines frequent patterns utilizing temporal joins along with efficient lattice search techniques. In the first step, the algorithm computes the frequencies of sequences with only one item, in the second step with two items and so on. (Zaki, 2001).
With this approach, a company can explore, understand or predict the visitor’s’ navigation patterns through a website or application.
Using cSPADE algorithm, we extracted all pattern sequences having a minimum support that we defined.
For example, the following 22 pattern sequences are supported at least 40% of the clickstreams.
For a given sequence pattern S, we can predict the next click by searching for the pattern sequence with the highest support that starts with S.
For example, after having just performed (A14) the most probable next action is (A4) according to pattern sequence 8 with support 73.5%.
Of course lowering the support would give us pattern sequences that are less frequent in our clickstreams. This may be useful in the case we want to extract pattern sequences that yield errors or failures in our software.
At this point, we have explored Markov Chains and SPADE algorithm for mining Clickstream sequence data. By deploying the proposed models we can:
- Extract data-driven personas for the most frequent digital journeys of our customers through an app or a web page.
- Predict the next actions based on those performed so far.
- Extract frequent sequential patterns
However, all this analysis is worthless unless it is driving actions. Based on the above results a recurring process of reviewing must be initialized regarding web or application design and content and marketing strategy.
For those interested to investigate further, as an alternative method for this analysis the computational search algorithm, considering a Genmax type approach can also be considered.
The code of the post can be found on Github.