Big data problem? Don't forget search

Big data problem? Don’t forget search

Big data problem? Don’t forget search

With every cool new technology, people get overly infatuated and start using it for the wrong things. For example: Looking through a bazillion records for a few million marked with a set of criteria is a rather stupid use of MapReduce or your favorite DAG implementation (see: Spark).

For that and similar tasks, don’t forget the original big data technology: search. With great open source tools like Solr, Lucidworks, and Elasticsearch, you have a powerful way to optimize your I/O and personalize your user experience. It's much better than holding fancy new tools from the wrong end.

Not long ago a client asked me how to use Spark to search through a bunch of data they’d streamed into a NoSQL database. The trouble was that their pattern was a simple string search and a drill-down. It was beyond the capabilities of the database to do efficiently: They would have to pull all the data out of storage and parse through it in memory. Even with a DAG it was a little slow (not to mention expensive) on AWS.

Read Also:
Why Our Leaders Need To Understand Data

Spark is great when you can put a defined data set in memory. Spark is not so great at sucking up the world, in part because in memory analytics are only as good as your ability to transfer everything to memory and pay for that memory. We still need to think about storage and how to organize it in a way that gets us what we need quickly and cleanly.

For that particular client, the answer was to index the data as it came in and pull back a subset for more advanced machine learning -- but leave search to a search index.

No clean line exists between search, machine learning, and certain related techniques. Clearly, information that's textual or linguistic tends to strongly indicate a search problem. Information that is numeric, binary, or simply not textual or linguistic in nature indicate a machine learning (or other) problem. There is overlap. There are even instances, such as anomaly detection, where either technique may be valid to use.

Read Also:
8 Best Practices to Maximize ROI from Predictive Analytics

 



Big Data Innovation Summit London

30
Mar
2017
Big Data Innovation Summit London

$200 off with code DATA200

Read Also:
Why Our Leaders Need To Understand Data

Data Innovation Summit 2017

30
Mar
2017
Data Innovation Summit 2017

30% off with code 7wData

Read Also:
Salesforce launches free Analytics Cloud Playground

Enterprise Data World 2017

2
Apr
2017
Enterprise Data World 2017

$200 off with code 7WDATA

Read Also:
Farmers get analytics 5x faster, thanks to guerrilla big data software

Data Visualisation Summit San Francisco

19
Apr
2017
Data Visualisation Summit San Francisco

$200 off with code DATA200

Read Also:
How to make big data work for SMEs

Chief Analytics Officer Europe

25
Apr
2017
Chief Analytics Officer Europe

15% off with code 7WDCAO17

Read Also:
Doubling down on digital transformation: Big Data, strategy and the IoT

Leave a Reply

Your email address will not be published. Required fields are marked *