With every cool new technology, people get overly infatuated and start using it for the wrong things. For example: Looking through a bazillion records for a few million marked with a set of criteria is a rather stupid use of MapReduce or your favorite DAG implementation (see: Spark).
For that and similar tasks, don’t forget the original big data technology: search. With great open source tools like Solr, Lucidworks, and Elasticsearch, you have a powerful way to optimize your I/O and personalize your user experience. It’s much better than holding fancy new tools from the wrong end.
Not long ago a client asked me how to use Spark to search through a bunch of data they’d streamed into a NoSQL database. The trouble was that their pattern was a simple string search and a drill-down. It was beyond the capabilities of the database to do efficiently: They would have to pull all the data out of storage and parse through it in memory. Even with a DAG it was a little slow (not to mention expensive) on AWS.
Spark is great when you can put a defined data set in memory. Spark is not so great at sucking up the world, in part because in memory analytics are only as good as your ability to transfer everything to memory and pay for that memory. We still need to think about storage and how to organize it in a way that gets us what we need quickly and cleanly.
For that particular client, the answer was to index the data as it came in and pull back a subset for more advanced machine learning — but leave search to a search index.
No clean line exists between search, machine learning, and certain related techniques. Clearly, information that’s textual or linguistic tends to strongly indicate a search problem. Information that is numeric, binary, or simply not textual or linguistic in nature indicate a machine learning (or other) problem. There is overlap. There are even instances, such as anomaly detection, where either technique may be valid to use.