22.03.2018, Lesezeit: ~6min
Welcome back to the third and final blog in this series. In the previous two blogs I introduced a fake data set and used an unsupervised learning algorithm to find some customer segments in the data. I also showed two ways to visually check the quality of the segmentation, talked about some informative statistics, and started to show some initial insights. This time we'll dig even deeper into the behaviour and habits represented by customers in each of the segments.
Bubble plots are a nice way to see the differences between clusters, as well as seeing some correlations in the data. Let's take a look at a couple of examples. In the following plots the size of the bubble is the number of customers in that segment, and the colour gives some information about the average age, remember that 7 is the oldest (red) and 6 is the youngest (blue).
The first interesting plot shows the total number of items bought across all orders made by a customer, and the number of orders made by a customer. Most of the segments sit roughly on a straight line, but segments 1 and 2 certainly do not. The retailer could focus on getting the customers in segment 2 to buy more items in each order, while looking to see if they can persuade customers in segment 1 to make more orders.
As demonstrated above, we can also use these bubble diagrams to work out what might be done to target particular segments. In the following plot we see that people in segments 1 and 4 both spend a little per order, but we might choose to target them differently because customers in segment 4 spend more per item than customers in segment 1, and therefore might be more interested in higher value goods.
All of the analysis we've done so far just looks at the mean value of a quantity within a cluster. However, the standard deviation, which tells us about the spread of the values, is also important, and this can be seen by looking at a box plot. Here's one showing the loyalty reward points used.
The nice thing about this plot is we can immediately see the segments saving their points (segments 0, 1, 5 and 7) and those happier to spend them (segment 3 in particular).
We're already building a picture of the segments, and in a real analysis we would look at many more bubble and box plots to get a more detailed picture of the demographics and behaviours of each one. However, let's move on and have a deeper look at the times when particular segments are shopping.
In the image below, you see a summary of the shopping times for the 1744 customers in segment 0. These customers shopped primarily in 2015, with Tuesdays being the most popular day and Wednesday the least popular day to shop. The most popular months were January, February and December, which is reflected in both the monthly and quarterly breakdowns. Within each month, there are more peaks in days towards the end of the month than at the beginning.
These plots can be produced for each cluster, and also for all customers (including outliers) for comparison. Here's an example of all customers (left), compared with only customers in segment 7 (right), who we see are relatively more likely to make purchases in July, and less likely to make purchases in September. In this way, the most popular shopping times can be identified, as well as special times to target a particular segment.
Let's look at another facet of this data – the times of day that customers are making purchases. I've only included 2 segments in this plot, with all of them included it's too crowded to pick out any details.
Targeting segment 4 between 2 and 5 a.m. makes little sense, because they are not shopping at all between those times. However, sending a carefully targeted email just before 10:00 or 20:00 could be perfect, since a lot of these customers are clearly making purchases at this time of the day.
In segment 5 the peaks are smaller, but there are more of them. For example, the one at 14:00 that isn't present for segment 4 shoppers.
Brands and categories
When looking at real data, I make in-depth analyses of the categories, brands and products that are popular within each cluster. I haven't included this in this blog in the interests of protecting our customer whose base data I have used. However, I can still show part of the analysis, indicating how I would begin such a task.
In the following figure, I've plotted the percentage of purchases made within each of the 8 product categories I've defined in the data. It's clear that while the overall distributions are broadly similar for each segment, we can still learn something. The most obvious feature is that category 3 is much more popular with segment 6 than any other. Examining each category in turn reveals which categories are popular and unpopular with the 8 customer segments.
We can make a similar plot in turn for the brands. Here the plot is dominated by brand 10 (own-brand, perhaps?) so we can remove this to see more detail in the features of the other 9 brands.
There is a spread in popularity for most brands across the different customer segments. Note that brand 3 is popular with cluster 6, so it would be interesting to see whether this brand overlaps much with category 3.
Looking at the data in this way allows me to quickly build a picture of what categories and brands are preferred by which segments, and therefore what products they might be interested in.
In conclusion, there are many tools available to summarise the results of segmentation. Using lots of different ways to visualise the data we are able to build up a picture not only of the clusters themselves, but also of the different properties of the customers in each one. This helps us to paint a picture of the habits and preferences of the clusters, to allow the right messages to be targeted to the right people at the right times.
Dr. Fern Watson