The folks at Pachyderm believe there’s an elephant in the room when it comes to data analytics, namely the weaknesses of Hadoop. So they set out to build an Hadoop alternative based on containers, version one of which has just been released.
The Pachyderm stack uses Docker containers as well as CoreOS and Kubernetes for cluster management. It replaces HDFS with its Pachyderm File System and MapReduce with Pachyderm Pipelines.
The company has focused on fundamental principles including reproducibility, data provenance and, most importantly, collaboration – a feature they say has been sorely missing from the Big Data world and one that has generated the most excitement from potential users, according to CEO and co-founder Joe Doliner.
To that end, the developers looked to the Git model to create a repository of well-documented, reusable analysis pipelines so that teams no longer have to build everything from scratch.
“We think it’s inevitable that containers will precipitate the end of Hadoop because they cause you to completely rethink all the assumptions that [were the basis] of Hadoop in the early days,” said co-founder Joey Zwicker.
As an example, he notes that in Hadoop, people write their jobs in Java, and it all runs on the JVM.
“This was a great assumption at the time because Java was the leading platform, there were tons of people on it. But fast-forward to today, and now we have containers. They’re a much better tool for the job because you can use literally any tool in the whole world. We’ve built a system that allows you to put anything in a container and use it for big data,” he said.
Rather than being required to use Hadoop-specific tools, such as the Hadoop Image Processing Library, you can any of the existing image-processing libraries. You can use any open-source computer vision tool such as OpenCV, ccv, or VXL.
“We believe people are going to want to use the tools that are best in class. Containers allow them to do that,” Zwicker said.
Though it’s written in Go, data scientists can use any languages or libraries that best fit their needs, they say.
The two components Pachyderm developed for the stack are file system and pipeline system.
Pachyderm Pipelines is a system of stringing containers together and doing data analysis with them. You create a containerized program with the tools of your choice that reads and writes to the local filesystem. It uses a FUSE volume to inject data into the container, then automatically replicates the container, showing each one a different chunk of data. This technique enables Pachyderm to scale any code you write to process massive data sets in parallel, according to Zwicker. It doesn’t require using Java at all: If it fits in a container, you can use it for data analysis.
Pachyderm File System is a distributed file system that draws inspiration from git, providing version control over all the data. It’s the core data layer that delivers data to containers. The data is stored in generic object storage such as Amazon’s S3, Google Cloud Storage or the open source Ceph file system. And like Apple’s Time Machine, it provides historical snapshots of how you data looked at different points in time.
“It lets you see how things have changed; it lets people work together,” Zwicker said. “It allows people to not only collaborate on code but on data. One data scientist can build a data set, and another can fork it and build off of it, then merge the results back with the original one. This is something that has been completely missing from the data science tools out there.