1231364-ou-victoria-xap

How to “Read” A Million Novels in One Afternoon with Spark

How to “Read” A Million Novels in One Afternoon with Spark

The world communicates in text. Our work lives have us treading waist-deep in email, our hobbies often have blogs, our complaints go to Yelp, and even our personal lives are lived out via tweets, Facebook updates, and texts. There is a massive amount of information that can be heard from text — as long as you know how to listen.

Learning from large amounts of text data is less a question of will and more a question of feasibility. How do you “read” the equivalent of thousands or even millions of novels in an afternoon?

Topic models extract the key concepts in a set of documents. Each concept can be described by a list of keywords from most to least important. Then, each document can be connected to those concepts, or topics, to determine how representative that document is of that overall concept.

For example, given a corpus of news stories, topics emerge independent of any one document. One topic may be characterized by terms like “poll,” “campaign,” and “debate,” which an observer will quickly see is a topic about politics and elections. Another sample topic may be characterized by terms like “euro,” “prime minister,” and “fiscal” – an observer immediately sees as a topic about the European economy. Commonly, a given document is not wholly about only a single topic but rather a mix of topics. The topic model outputs a probability that the document is about each possible topic. An analyst interested in how an election may affect the European economy can now isolate the topics of interest and search directly for those documents that contain a mixture. In a very short period of time, what started as thousands or millions of documents can be whittled to only the most important few.

Read Also:
5 Big Predictions for Business Intelligence in 2017 – Share Talk

To build these topics, an algorithm is employed called Latent Dirichlet Allocation. The algorithm will need to store vectors representing each possible term multiplied by the number of documents. Then, it will iterate through hundreds of thousands of cycles, seeking to improve each abstract topic at each stage.

Apache Spark is optimal for building a pipeline for this task. Spark uses the pool of memory across many servers to break up the problem into many parallel components and uses the comparative speed of RAM to quickly iterate the algorithm. Because the task can be broken into an arbitrary number of smaller pieces and iteration can continue at speed, it handles very large amounts of text data as easily as a single machine handles a moderate amount. 

To see how Spark can handle a massive amount of text data, consider the case of a University of Oklahoma doctoral student in Political Science. She wants to investigate how international politics have changed over the last 20 years, using congressional hearing transcripts from 1995 to 2015.;

Read Also:
This Startup's Robots Make Music to Make Your Brain Focus

 



Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
Through the Healthcare Industry, a Look at Digital Transformation

AI Paris

6
Jun
2017
AI Paris

20% off with code AIP17-7WDATA-20

Read Also:
Qualitative Data: The Context that Gives Meaning to Your Big Data

Chief Data Officer Summit San Francisco

7
Jun
2017
Chief Data Officer Summit San Francisco

$200 off with code DATA200

Read Also:
Selling Hadoop to the C-Suite: It’s all about business value

Customer Analytics Innovation Summit Chicago

7
Jun
2017
Customer Analytics Innovation Summit Chicago

$200 off with code DATA200

Read Also:
Chief Data Officer: The True Dean of Big Data?

HR & Workforce Analytics Innovation Summit 2017 London

12
Jun
2017
HR & Workforce Analytics Innovation Summit 2017 London

$200 off with code DATA200

Read Also:
Fire up big data processing with Apache Ignite

Leave a Reply

Your email address will not be published. Required fields are marked *