Why Topological Data Analysis Works

Why Topological Data Analysis Works

Topological data analysis has been very successful in discovering information in many large and complex data sets. In this post, I would like to discuss the reasons why it is an effective methodology.

One of the key messages around topological data analysis is that data has shape and the shape matters. Although it may appear to be a new message, in fact it describes something very familiar.

The example above is a regression line, obtained by fitting a straight line to the data points using a natural measure of fit. A straight line is certainly a shape, and in the above example, we find that a straight line fits the given data quite well. That piece of information is extremely important for a number of reasons. One is that it gives us the qualitative information that the y-variable varies directly with the x-variable (i.e. that y increases as x increases). Another is that it permits us to predict with reasonable accuracy one of the variables if we know the value of the other variable. The idea is that the shape of a line is a useful organizing principle for the data set, which permits us to extract useful information from it.

Unfortunately, the data does not always cooperate and fit along a line. Consider, for example, the data set below.

It is easy to see that no straight line faithfully represents this data.

The reason is that this data set breaks into a set of three tightly concentrated clusters. One might not initially think of this as having anything to do with shape, but after a moment’s reflection, we realize that the most fundamental aspect of any shape is the number of connected pieces it breaks into. So, in this case, we see that the shape of this data set is of fundamental importance, and that its shape is not that of a line.

At this point, we might think that we could now proceed by assuming that any data set is well approximated by a line, a family of clusters, or perhaps a family of lines. Here is another data set that demonstrates that this is not the case.

Notice that this shape also does not fit along a line, and does not break into clusters, but rather has a “loopy” behavior. This kind of structure is often associated with periodic or recurrent behavior in the data set. Here is another example.

The shape is in this case that of a capital letter “Y”. This is another kind of shape, which also occurs frequently. Note that it has a central core and three “flares” extending from it. This might represent a situation where the core represents the most frequently occurring behaviors, and the tips of the flares represent the extreme behaviors in the data. It is clearly distinct from the three other shapes we have already discussed.

One might now say that a way to understand data would be to take each of these types, and attempt to fit a template for each to the data to determine which type one is in. This fitting process is what is done in linear regression, which is the first example above. The problem with this approach is that there are an infinite variety of different possible shapes, a large number of which occur in real data sets. All four that we have shown certainly do, but many others do as well, as demonstrated in the image below.

The immense variety possible among shapes suggests that we should not attempt to enumerate all the possible shapes, and create templates for each, but rather find a flexible way of representing all shapes.

That is one of the problems topological data analysis deals with.

To give an idea of why Topological Data Analysis often works better than other methods of displaying data sets, such as scatterplots based on principal component analysis or multidimensional scaling, it is useful to consider another example.

 

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

How to make social data more relevant, trustworthy and valuable

1 Feb, 2017

I want to share with you an overview of the steps a user takes to build a project in Watson …

Read more

Clustering Key Terms, Explained

17 Jul, 2018

Getting started with Data Science or need a refresher? Clustering is among the most used tools of Data Scientists. Check …

Read more

Why data integration and collaboration is essential to the success of the smart city

2 Oct, 2017

In today’s always on world, we live with an ever-growing number of connected devices. We no longer just use and …

Read more

Recent Jobs

Senior Cloud Engineer (AWS, Snowflake)

Remote (United States (Nationwide))

9 May, 2024

Read More

IT Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Data Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Applications Developer

Washington D.C., DC, USA

1 May, 2024

Read More

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.