In this post, we’ll walk through one such algorithm called K-Means Clustering, how to measure its efficacy, and how to choose the sets of segments you generate.

One of the most common analyses we perform is to look for patterns in data. What market segments can we divide our customers into? How do we find clusters of individuals in a network of users?

It’s possible to answer these questions with Machine Learning. Even when you don’t know which specific segments to look for, or have unstructured data, you can use a variety of techniques to algorithmically find emergent patterns in your data and properly segment or classify outcomes.

In this post, we’ll walk through one such algorithm called K-Means Clustering, how to measure its efficacy, and how to choose the sets of segments you generate.

In classification of data, there are two types of Machine Learning available.

With Supervised Learning, you can predict classifications of outcomes when you already know which inputs map to which discrete segments. But in many situations, you won’t actually have such labels predefined for you – you’ll only be given a set of unstructured data without any defined segments. In these cases you’ll need to use Unsupervised Learning to infer the segments from unlabeled data.

For clarity, let’s take the example of classifying t-shirt sizes.

If we’re given a dataset like in Figure 1A above, we’d have a set of inputs width (X1) and length (X2), as well as their corresponding t-shirt size of say small (blue) and large (green). In such a scenario we can use Supervised Learning techniques like Logistic Regression to draw a clear decision boundary and separate the respective classes of t-shirts.

But if we are given a dataset like Figure 1B, we’ll have a set of inputs width (X1) and length (X2), but no corresponding label for t-shirt size. In this case, we’ll need to use Unsupervised Learning techniques like K-Means Clustering to find similar sets of t-shirts and cluster them together into the respective classes of small (blue circle) and large (green circle).

In many real-world applications you’ll face cases like that in Figure 2A, so it’s helpful to walk through how to actually find structure in unstructured data.

To find structure in unstructured data, K-Means Clustering provides a straightforward application for Unsupervised Machine Learning.

K-Means Clustering works as its name would imply – assigning similar observations in your data to a set of clusters. It operates in 4 simple and repeatable steps, wherein you iteratively evaluate a set of clusters that provide the closest mean (average) distance to each of your observations. It follows that if a set of observations are close in proximity to a one another, its likely they are part of a cluster.

Let’s walk through the algorithm step-by-step. The first step is to randomly initialize a set of “centroids” (the Xs in Figure 2A above), or centers to your clusters. You can set these centroids anywhere to start, but it’s recommended to initialize them at a random set of points matching your observations. You will in turn use these centroids to group your observations, assigning a centroid to each observation by those closest in distance (the blue and green circles in Figure 2B).

This will initialize a set of clusters to group the observations in your data to those closest to the same centroid. But it’s unlikely that these initial clusters are perfectly fit on their first assignment.