Pachyderm Challenges Hadoop with Containerized Data Lakes

The folks at Pachyderm believe there’s an elephant in the room when it comes to data analytics, namely the weaknesses of Hadoop. So they set out to build an Hadoop alternative based on containers, version one of which has just been released.

The Pachyderm stack uses Docker containers as well as CoreOS and Kubernetes for cluster management. It replaces HDFS with its Pachyderm File System and MapReduce with Pachyderm Pipelines.

The company has focused on fundamental principles including reproducibility, data provenance and, most importantly, collaboration – a feature they say has been sorely missing from the Big Data world and one that has generated the most excitement from potential users, according to CEO and co-founder Joe Doliner.

To that end, the developers looked to the Git model to create a repository of well-documented, reusable analysis pipelines so that teams no longer have to build everything from scratch.

“We think it’s inevitable that containers will precipitate the end of Hadoop because they cause you to completely rethink all the assumptions that [were the basis] of Hadoop in the early days,” said co-founder Joey Zwicker.

As an example, he notes that in Hadoop, people write their jobs in Java, and it all runs on the JVM.

“This was a great assumption at the time because Java was the leading platform, there were tons of people on it. But fast-forward to today, and now we have containers. They’re a much better tool for the job because you can use literally any tool in the whole world. We’ve built a system that allows you to put anything in a container and use it for big data,” he said.

Rather than being required to use Hadoop-specific tools, such as the Hadoop Image Processing Library, you can any of the existing image-processing libraries. You can use any open-source computer vision tool such as OpenCV, ccv, or VXL.

“We believe people are going to want to use the tools that are best in class. Containers allow them to do that,” Zwicker said.

Though it’s written in Go, data scientists can use any languages or libraries that best fit their needs, they say.

The two components Pachyderm developed for the stack are file system and pipeline system.

Pachyderm Pipelines is a system of stringing containers together and doing data analysis with them. You create a containerized program with the tools of your choice that reads and writes to the local filesystem. It uses a FUSE volume to inject data into the container, then automatically replicates the container, showing each one a different chunk of data. This technique enables Pachyderm to scale any code you write to process massive data sets in parallel, according to Zwicker. It doesn’t require using Java at all: If it fits in a container, you can use it for data analysis.

Pachyderm File System is a distributed file system that draws inspiration from git, providing version control over all the data. It’s the core data layer that delivers data to containers. The data is stored in generic object storage such as Amazon’s S3, Google Cloud Storage or the open source Ceph file system. And like Apple’s Time Machine, it provides historical snapshots of how you data looked at different points in time.

“It lets you see how things have changed; it lets people work together,” Zwicker said. “It allows people to not only collaborate on code but on data. One data scientist can build a data set, and another can fork it and build off of it, then merge the results back with the original one. This is something that has been completely missing from the data science tools out there.

 

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

How AI helps combat Cyber-crime

21 Sep, 2021

Artificial Intelligence (AI)-based cyber-security methodologies have been developed in response to this unprecedented challenge to help AI information security teams …

Read more

Top 10 Data Predictions leading to zero-latency future in 2019

1 Jan, 2019

Just like electricity, we will soon start to see real-time computing become pervasively available and invisible at the same time, …

Read more

Why YouTube decided to make its own video chip

2 Sep, 2022

Roughly seven years ago, Partha Ranganathan realized Moore’s law was dead. That was a pretty big problem for the Google …

Read more

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.