Making big data manageable

Making big data manageable

Making big data manageable

One way to handle big data is to shrink it. If you can identify a small subset of your data set that preserves its salient mathematical relationships, you may be able to perform useful analyses on it that would be prohibitively time consuming on the full set.

The methods for creating such “coresets” vary according to application, however. Last week, at the Annual Conference on Neural Information Processing Systems, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory and the University of Haifa in Israel presented a new coreset-generation technique that’s tailored to a whole family of data analysis tools with applications in natural-language processing, computer vision, signal processing, recommendation systems, weather prediction, finance, and neuroscience, among many others.

“These are all very general algorithms that are used in so many applications,” says Daniela Rus, the Andrew and Erna Viterbi Professor of electrical engineering and Computer Science at MIT and senior author on the new paper. “They’re fundamental to so many problems. By figuring out the coreset for a huge matrix for one of these tools, you can enable computations that at the moment are simply not possible.”

Read Also:
Biogen gains fast-track Alzheimer’s drug review in wake of early data

As an example, in their paper the researchers apply their technique to a matrix — that is, a table — that maps every article on the English version of Wikipedia against every word that appears on the site. That’s 1.4 million articles, or matrix rows, and 4.4 million words, or matrix columns.

That matrix would be much too large to analyze using low-rank approximation, an algorithm that can deduce the topics of free-form texts. But with their coreset, the researchers were able to use low-rank approximation to extract clusters of words that denote the 100 most common topics on Wikipedia. The cluster that contains “dress,” “brides,” “bridesmaids,” and “wedding,” for instance, appears to denote the topic of weddings; the cluster that contains “gun,” “fired,” “jammed,” “pistol,” and “shootings” appears to designate the topic of shootings.

Joining Rus on the paper are Mikhail Volkov, an MIT postdoc in electrical engineering and computer science, and Dan Feldman, a lecturer at the University of Haifa and a former postdoc in Rus’s group.

Read Also:
What industries are next to be disrupted by NLP and Text Analysis?

The researchers’ new coreset technique is useful for a range of tools with names like singular-value decomposition, principal-component analysis, and nonnegative matrix factorization. But what they all have in common is dimension reduction: They take data sets with large numbers of variables and find approximations of them with far fewer variables.

In this, these tools are similar to coresets. But coresets simply reduce the size of a data set, while the dimension-reduction tools change its description in a way that’s guaranteed to preserve as much information as possible. That guarantee, however, makes the tools much more computationally intensive than coreset generation — too computationally intensive for practical application to large data sets.

The researchers believe that their technique could be used to winnow a data set with, say, millions of variables — such as descriptions of Wikipedia pages in terms of the words they use — to merely thousands.

 



Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
5 big data trends that will shape AI in 2017
Read Also:
Why Machine Learning Can Improve Customer Service

AI Paris

6
Jun
2017
AI Paris

20% off with code AIP17-7WDATA-20

Read Also:
From Science to Data Science, a Comprehensive Guide for Transition

Chief Data Officer Summit San Francisco

7
Jun
2017
Chief Data Officer Summit San Francisco

$200 off with code DATA200

Read Also:
Why Machine Learning Can Improve Customer Service

Customer Analytics Innovation Summit Chicago

7
Jun
2017
Customer Analytics Innovation Summit Chicago

$200 off with code DATA200

Read Also:
Is Big Data giving advertising a Bad Reputation?

Big Data and Analytics Marketing Summit London

12
Jun
2017
Big Data and Analytics Marketing Summit London

$200 off with code DATA200

Read Also:
Your custom MDM solution: Don’t make these mistakes (Part 1)

Leave a Reply

Your email address will not be published. Required fields are marked *