Topic Modeling Large Amounts of Text Data

The world communicates in text. Our work lives have us treading waist-deep in email, our hobbies often have blogs, our complaints go to Yelp, and even our personal lives are lived out via tweets, Facebook updates, and texts. There is a massive amount of information that can be heard from text – as long as you know how to listen.

Learning from large amounts of text data is less a question of will and more a question of feasibility. How do you “read” the equivalent of thousands or even millions of novels in an afternoon?

Topic models extract the key concepts in a set of documents. Each concept can be described by a list of keywords from most to least important. Then, each document can be connected to those concepts, or topics, to determine how representative that document is of that overall concept.

For example, given a corpus of news stories, topics emerge independent of any one document. One topic may be characterized by terms like “poll,” “campaign,” and “debate,” which an observer will quickly see is a topic about politics and elections. Another sample topic may be characterized by terms like “euro,” “prime minister,” and “fiscal” – an observer immediately sees as a topic about the European economy. Commonly, a given document is not wholly about only a single topic but rather a mix of topics. The topic model outputs a probability that the document is about each possible topic. An analyst interested in how an election may affect the European economy can now isolate the topics of interest and search directly for those documents that contain a mixture. In a very short period of time, what started as thousands or millions of documents can be whittled to only the most important few.

When the Data Gets Large

To build these topics, an algorithm is employed called Latent Dirichlet Allocation. The algorithm will need to store vectors representing each possible term multiplied by the number of documents. Then, it will iterate through hundreds of thousands of cycles, seeking to improve each abstract topic at each stage.

Apache Spark is optimal for building a pipeline for this task. Spark uses the pool of memory across many servers to break up the problem into many parallel components and uses the comparative speed of RAM to quickly iterate the algorithm. Because the task can be broken into an arbitrary number of smaller pieces and iteration can continue at speed, it handles very large amounts of text data as easily as a single machine handles a moderate amount. 

To see how Spark can handle a massive amount of text data, consider the case of a University of Oklahoma doctoral student in Political Science. She wants to investigate how international politics have changed over the last 20 years, using congressional hearing transcripts from 1995 to 2015. However, as you likely would expect, Congress loves to talk. Over this 20-year period, there are more than 19,000 hearing documents. The average hearing runs approx. 32,000 words, with the longest at nearly 900,000 words long.;

 

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

Here’s How to Make Your Data Science Skills Future-Ready During the Pandemic

9 Apr, 2021

Data science is quickly becoming one of the most sought-after skills by employers of all sorts. Businesses, government organizations, medical …

Read more

Who’s Winning in Open Source Data Tech

23 Aug, 2021

Organizations understandably want the innovation that comes with using open source data management tech, but how do you know when …

Read more

Big data technology market speeding up – with NoSQL and Hadoop at forefront

7 Oct, 2016

The big data technology landscape is varied and vast – and a new report from Forrester examines the key strands, …

Read more

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.