Finding hidden correlations​


08.03.2018, Lesezeit: ~5min

Interpreting the results  

Welcome back! In my Blog “Segmenting customer data” we had a lengthy introduction to the ideas behind customer segmentation. I also introduced the data set we'll be looking at and talked a briefly about the steps necessary to get to this point. Now comes the fun – interpreting the results!

The machine learning algorithm is powerful in finding correlations and grouping similar customers. What it doesn't do is provide a concise and understandable explanation of these correlations, so we need to find them for ourselves! In the remainder of this blog post, I'll show a few ways I like to do that. Remember that the end goal is to build a picture of each segment, especially how they are unique and what might motivate the customers in them to be more engaged with the company or brand.

Scatter plot and tSNE 

An initial problem we face is that one customer is represented by a lot of different pieces of information. If we want to look at the data to see the segments, it's hard to know which of the pieces of information to choose. Should we plot age against the number of orders made? Or perhaps the total amount spent against the annual purchase frequency? Fortunately, there are a couple of different options.

One way to visualise the clusters is using a tSNE plot. The data gets carefully squashed down into two dimensions, while preserving the distances between points, so it can be plotted. Each point is a customer, and the segments are coloured differently, labelled from 0 to 7, meaning a total of 8 segments. Don't worry about what the axes mean!

  • Teilen:

The most important thing here is that the segments are generally quite well separated and distinct. It seems like there's a few exceptions, but we would expect that because of the level of processing required to produce this plot. Overall, a picture like this gives me confidence that the segments identified are reliable.

An alternative, less elegant solution is simply to take some 2D slices through the data. Again, the numbers on the axes don't really mean anything, it just gives an intuitive picture that helps us to feel confident that each segment is well-defined and distinct from the others.

In the top and bottom slices, the segments lie on top of each other, but we see from the middle slice that they are really quite well separated. Again, this gives me confidence that the segments are distinct and although I'll expect to see variance within each cluster, they will be more similar than different.

Average values and bubble plots  

Although there's one point in each of the images above for each customer, the points can be really close together so that the exact size of the segments is hard to see. It's probably worth taking a look at the sizes of the segments.

Interestingly 5 out of the 8 segments have roughly 7-800 customers, with three bigger ones. The fact that there are no tiny segments, or any that are many, many times bigger than the rest, is good news!

Mean values  

Now we know that the segments are reasonable sizes and make sense, visually. Next, I can try to dig into some of the numbers to see some of the distinctions between the segments. The first thing we can do is look at the average values of the quantities for each cluster. This will tell some of the story, and perhaps draw attention to some of the features we want to understand in more depth.

Here are some of the values listed for the means. When looking at real data I would also look at the medians (middle values) for each segment, but I haven't listed these results.

This is an interesting result! We can immediately pull out segment 7 as the oldest and 6 as the youngest. For some quantities, such as the number of orders per customer, the average value doesn't vary much and is always around 2 orders. This is interesting in itself – the values for individuals must range from 1 to many, many more than 2, but the average is always around 2. This is a key result for our customer, who will want to try to target particular segments in different ways to increase this number.

Other quantities are very similar within the customer segments, such as the number of items per order (41-46), although this varies a lot from the average over all customers (about 34). The spend per order also varies, with cluster 5 spending the most (almost €16.80) and 1 spending the least (€11.60) per order.

We're beginning to build up a picture of each of the segments. Tune in next time (Visualising customer segmentation insights) when we'll move on and look at other ways to get insights about them.


Dr. Fern Watson

Data Scientiest

We keep you informed

Mit unserem Blog-Newsletter informieren wir Sie regelmäßig über Spannendes, Ungewöhnliches, Neues & Kommendes aus der IT-Welt.

Kontakt