Check out 5 Big Data projects that you are not likely to have seen before, but which may be useful to you, and perhaps even scratch an itch you didn’t know you had.
The Big Data Ecosystem is big. Some say it’s too damn big!
Consider, first, the behemoths in the space, the Big Data processing frameworks: Hadoop. Spark. Flink. Any of the other umpteen Apache projects. Google’s platforms. Many others. They all work in the same general space, but with various differentiating factors.
Next consider the support tools in the various data processing ecosystems. Then have a look at the various data stores and NoSQL database engines available. Then think about all of the tools that fit particular niches, both “official” and unofficial, that grow out of both large companies and individuals’ ingenuity.
It is this final category that we are concerned with herein. We will take a look at 5 Big Data projects that are outside of the mainstream, but which still have something to offer, perhaps unexpectedly so.
As always, finding overlooked projects is much more art than science. I collected these projects over the course of time spent online over an extended period. The only criteria was that the projects were not alpha-level projects (subjective, no?), caught my eye for some particular reason, and had Github repos. The projects are not presented in any particular order, but are numbered like they are, mostly for ease of referencing, but also because I like numbering things.
Luigi was originally developed at Spotify, and is used to craft data pipeline jobs. From its Github repository README:
Luigi stresses that it does not replace lower-level data-processing tools such as Hive or Pig, but is instead meant to create workflows between numerous tasks. Luigi supports Hadoop out of the box as well, which potentially makes it a much more attractive option for many, many users. Luigi also supports file system abstractions for HDFS, and local files enforce operation atomicity, which is essential for ensuring state between pipeline tasks.
Luigi also comes with a web interface for visualizing and managing your tasks:
Luigi is also gaining in popularity, and currently boasts nearly 5000 repo stars on Github, which is impressive for something I’m categorizing as “not popular.” If you are interested in seeing it in action, here is a tutorial on using Luigi together with Python to build data pipelines, written by Marco Bonzanini.
I’m a fan of pipelines; if you are too, Luigi may be a project worth checking out for managing your data processing tasks and workflows.
Developer Altamira reasons that the appropriate tools for exploiting data and extracting insight were not well-enough developed, and so they took it upon themselves to design Lumify, a tool to aggregate, organize, and extract insight from your data.