Big data problem? Don't forget search

Big data problem? Don’t forget search

Big data problem? Don’t forget search

With every cool new technology, people get overly infatuated and start using it for the wrong things. For example: Looking through a bazillion records for a few million marked with a set of criteria is a rather stupid use of MapReduce or your favorite DAG implementation (see: Spark).

For that and similar tasks, don’t forget the original big data technology: search. With great open source tools like Solr, Lucidworks, and Elasticsearch, you have a powerful way to optimize your I/O and personalize your user experience. It's much better than holding fancy new tools from the wrong end.

Not long ago a client asked me how to use Spark to search through a bunch of data they’d streamed into a NoSQL database. The trouble was that their pattern was a simple string search and a drill-down. It was beyond the capabilities of the database to do efficiently: They would have to pull all the data out of storage and parse through it in memory. Even with a DAG it was a little slow (not to mention expensive) on AWS.

Read Also:
Predictive Analytics Let Manufacturers See More Clearly into Their Supply Chains

Spark is great when you can put a defined data set in memory. Spark is not so great at sucking up the world, in part because in memory analytics are only as good as your ability to transfer everything to memory and pay for that memory. We still need to think about storage and how to organize it in a way that gets us what we need quickly and cleanly.

For that particular client, the answer was to index the data as it came in and pull back a subset for more advanced machine learning -- but leave search to a search index.

No clean line exists between search, machine learning, and certain related techniques. Clearly, information that's textual or linguistic tends to strongly indicate a search problem. Information that is numeric, binary, or simply not textual or linguistic in nature indicate a machine learning (or other) problem. There is overlap. There are even instances, such as anomaly detection, where either technique may be valid to use.

Read Also:
5 ways businesses can capitalize on smart data discovery tools

 



Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
11 Must Read Big Data case studies in Telecom Industry

AI Paris

6
Jun
2017
AI Paris

20% off with code AIP17-7WDATA-20

Read Also:
How IoT, big data analytics and cloud continue to be high priorities for developers

Chief Data Officer Summit San Francisco

7
Jun
2017
Chief Data Officer Summit San Francisco

$200 off with code DATA200

Read Also:
How to Create a Business Case for Data Quality Improvement

Customer Analytics Innovation Summit Chicago

7
Jun
2017
Customer Analytics Innovation Summit Chicago

$200 off with code DATA200

Read Also:
Three Reasons Why Visual Management Boards Fail

HR & Workforce Analytics Innovation Summit 2017 London

12
Jun
2017
HR & Workforce Analytics Innovation Summit 2017 London

$200 off with code DATA200

Read Also:
Predictive Analytics Let Manufacturers See More Clearly into Their Supply Chains

Leave a Reply

Your email address will not be published. Required fields are marked *