Automatically Segmenting Data With Clustering

Automatically Segmenting Data With Clustering

Automatically Segmenting Data With Clustering

In this post, we’ll walk through one such algorithm called K-Means Clustering, how to measure its efficacy, and how to choose the sets of segments you generate.

One of the most common analyses we perform is to look for patterns in data. What market segments can we divide our customers into? How do we find clusters of individuals in a network of users?

It’s possible to answer these questions with Machine Learning. Even when you don’t know which specific segments to look for, or have unstructured data, you can use a variety of techniques to algorithmically find emergent patterns in your data and properly segment or classify outcomes.

In this post, we’ll walk through one such algorithm called K-Means Clustering, how to measure its efficacy, and how to choose the sets of segments you generate.

In classification of data, there are two types of Machine Learning available.

With Supervised Learning, you can predict classifications of outcomes when you already know which inputs map to which discrete segments. But in many situations, you won’t actually have such labels predefined for you – you’ll only be given a set of unstructured data without any defined segments. In these cases you’ll need to use Unsupervised Learning to infer the segments from unlabeled data.

Read Also:
Mega collection of data science books and terminology

For clarity, let’s take the example of classifying t-shirt sizes.

If we’re given a dataset like in Figure 1A above, we’d have a set of inputs width (X1) and length (X2), as well as their corresponding t-shirt size of say small (blue) and large (green). In such a scenario we can use Supervised Learning techniques like Logistic Regression to draw a clear decision boundary and separate the respective classes of t-shirts.

But if we are given a dataset like Figure 1B, we’ll have a set of inputs width (X1) and length (X2), but no corresponding label for t-shirt size. In this case, we’ll need to use Unsupervised Learning techniques like K-Means Clustering to find similar sets of t-shirts and cluster them together into the respective classes of small (blue circle) and large (green circle).

In many real-world applications you’ll face cases like that in Figure 2A, so it’s helpful to walk through how to actually find structure in unstructured data.

Read Also:
Two Pathways for Establishing Data Confidence

To find structure in unstructured data, K-Means Clustering provides a straightforward application for Unsupervised Machine Learning.

K-Means Clustering works as its name would imply – assigning similar observations in your data to a set of clusters. It operates in 4 simple and repeatable steps, wherein you iteratively evaluate a set of clusters that provide the closest mean (average) distance to each of your observations. It follows that if a set of observations are close in proximity to a one another, its likely they are part of a cluster.

Let’s walk through the algorithm step-by-step. The first step is to randomly initialize a set of “centroids” (the Xs in Figure 2A above), or centers to your clusters. You can set these centroids anywhere to start, but it’s recommended to initialize them at a random set of points matching your observations. You will in turn use these centroids to group your observations, assigning a centroid to each observation by those closest in distance (the blue and green circles in Figure 2B).

Read Also:
The Undeniable Power of Data Visualization and Infographics

This will initialize a set of clusters to group the observations in your data to those closest to the same centroid. But it’s unlikely that these initial clusters are perfectly fit on their first assignment.



Chief Analytics Officer Spring 2017

2
May
2017
Chief Analytics Officer Spring 2017

15% off with code MP15

Read Also:
U.S. Chief Data Officer: 'Time is Now' For Technologists to Jump into Public Service

Big Data and Analytics for Healthcare Philadelphia

17
May
2017
Big Data and Analytics for Healthcare Philadelphia

$200 off with code DATA200

Read Also:
Two Pathways for Establishing Data Confidence

SMX London

23
May
2017
SMX London

10% off with code 7WDATASMX

Read Also:
U.S. Chief Data Officer: 'Time is Now' For Technologists to Jump into Public Service

Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
What Catalog Shopping Can Teach Us About Data

AI Paris

6
Jun
2017
AI Paris

20% off with code AIP17-7WDATA-20

Read Also:
U.S. Chief Data Officer: 'Time is Now' For Technologists to Jump into Public Service

Leave a Reply

Your email address will not be published. Required fields are marked *