22.02.2018, Lesezeit: ~5min
Introduction to segmentation
Customer segmentation lies at the heart of our Business Intelligence analysis at Cards & Systems. There are two main reasons for this. Firstly, although companies understand their target customers and desired user base, they don't necessarily have a clear picture of who really uses their services, how they use it, and whether people behave in similar ways when they engage with the brand or not.
The second reason is that all of the sophisticated analysis we can offer (one example is to look at the goods or services that are commonly purchased together) becomes even more valuable when it is broken out into segments so that differences in patterns can be observed.
Think about people doing their grocery shopping online – of course if you're responsible for monitoring the online shop you could easily see the peak times for people to make their purchases. If, however, we employ our customer segmentation and look again, we would expect to see that the peaks throughout the day correspond to different groups of people. We certainly wouldn't be surprised to see that the group making purchases at 2 in the morning might be rather different to those buying at 2 in the afternoon!
How segmentation is performed
The segmentation is done by powerful machine learning tools. The algorithms take all the information about customers and, using the hidden correlations in the data, find groupings that are similar. These similarities could be obvious – one clear possibility is differences based on age – but they can also be very complex, spanning many different characteristics. The end goal, and one of the tasks that requires Cards & Systems' extensive market knowledge, is to interpret the results to draw concise, meaningful conclusions about each segment that allows companies to better understand their customers. One example of a segment we might see from the analysis of the grocery shop data is the shift worker, who we might see making purchases of mainly convenience foods at strange hours.
An essential part of the segmentation is the information that is included about each customer. It's unlikely to identify regional differences in leisure centre usage if it has no information about the location of the centres! Typically, the information that is included will broadly fit into one of three categories:
The examples I've given above highlight the types of information we could use to draw distinctions between customer segments. Depending on the industry, data that is available, and the intended direction of the analysis after segmentation, we would expect to include many, many other distinguishing features. We may even wish to supplement the available data, for example looking at how people's habits vary according to the weather, social media buzz, or even in response to geopolitical events.
The data set
A great Business Intelligence team needs a wide variety of skills and experience, bringing together the worlds of science, technology, business and marketing. At the core of all the analysis we do, and all of the decisions and recommendations we make, we need data. Building on this data, we use functions and algorithms to process and interpret it, giving rise to insights that would not have been possible without Business Intelligence. This could involve many steps, beginning with understanding, cleaning and pre-processing the data, moving on to using machine learning and optimisation tools, finally writing ad-hoc algorithms to perform the most complex analysis. One strength of the Business Intelligence team at Cards & Systems is our communication – I speak to my colleagues with in-depth business knowledge and experience every day, at every stage of the analysis. This allows me to extract the most meaningful and relevant results from the data.
To demonstrate customer segmentation, I have created a fake data set. I will introduce the data and some ideas behind the analysis in this blog entry, and then show the segmentation and interpretation in subsequent posts.
The data structure
The data structure comes from an online shop, and therefore contains information about customers, orders, and products. The latter two categories contain information about when the order was placed, what products were purchased, how many of each item, and how much they cost. This data set is greatly simplified – in the real analysis I would include much more, for example not only the information about how much the products themselves cost, but whether any discounts were applied, and therefore how much was really paid in the end.
Customer data in the fake data set is limited. This is also true in real data sets, and reflects how seriously we take data protection and privacy, both in law and ethically speaking. Handling data is a great responsibility, and our end goal is to bring real value to both our clients and the end consumers. We believe that certain pieces of information are key to targeting particular messages to the right people, but that including other information is not necessary, or perhaps not even ethical. One key example is gender – in 2018 it feels irrelevant, if not irresponsible, to do any type of analysis based on (particularly binary) gender identification.
Adding more data
After cleaning the data, removing problematic entries (such as dates with the year 3000!) and filling in as many of the blank spaces as possible, I can make some extra calculations. The most obvious ones would be things like the average amount each customer spends per order, per item, and per year. Another thing I could do, which I haven't here, is add completely new data. As mentioned before, this could include weather data, to see if there's a group of people who only shop online when the weather is bad! If the company has physical locations as well as an online shop, we could also see how the proximity to a physical shop affects the customers' engagement with the online shop. This extra information helps to draw out more unexpected correlations and understand the customers' behaviours in more depth.
After the pre-processing described above (cleaning and augmentation), the data is ready for segmentation. This is done using a machine learning algorithm – at its simplest, this means that the computer finds a starting solution, assesses its quality, and changes it according to the outcome of that assessment. By repeating these steps, the computer quickly finds a high-quality solution to the problem.
The class of problem that the algorithm is trying to solve is called an unsupervised classification problem. Classification means assigning labels – in this case the label of which segment each customer belongs to. Sometimes part of the data comes already labelled, so the algorithm can learn what characteristics the segments have before assigning new, unlabelled data to the existing groups. This is supervised learning. However, for customer segmentation this labelling doesn't exist in advance, which is why it's called an unsupervised problem.
These types of problems are solved using a clustering algorithm. There are a large number of different clustering algorithms to use, each with its strengths and weaknesses. Here at Cards & Systems we're using the Python programming language's powerful Scikit-learn package. The result of the clustering is that every customer acquires a label. If the algorithm has identified 3 segments, they will be labelled 0, 1 and 2. This is just a quirk of the Python language! Importantly, the labels themselves don't really mean anything. We could swap `0' and ‘3’, or swap from numerical to alphabetic labels, and all the results would stay the same.
Interestingly, some customers will also be labelled -1. This has a special meaning – they are outliers. Taking the grocery example again, we expect people buy a selection of items such as eggs, meat, fruit and vegetables. Of course, we expect there will be some variation, but if a particular customer buys only dried fruit and nothing else, that kind of behaviour is unlikely to fit anyone else's patterns. When we look at the segment interpretations we'll throw away these outlier results, since there aren't really any useful generalisations to be made about them.
In my next blog entry ("Finding hidden correlations in customer data"), I'll look at the segments that have been identified, try to assess their quality, and begin some of the interpretation of the differences between them.
Dr. Fern Watson