elephant-1302155_640

Pachyderm Challenges Hadoop with Containerized Data Lakes

Pachyderm Challenges Hadoop with Containerized Data Lakes

The folks at Pachyderm believe there’s an elephant in the room when it comes to data analytics, namely the weaknesses of Hadoop. So they set out to build an Hadoop alternative based on containers, version one of which has just been released.

The Pachyderm stack uses Docker containers as well as CoreOS and Kubernetes for cluster management. It replaces HDFS with its Pachyderm File System and MapReduce with Pachyderm Pipelines.

The company has focused on fundamental principles including reproducibility, data provenance and, most importantly, collaboration – a feature they say has been sorely missing from the Big Data world and one that has generated the most excitement from potential users, according to CEO and co-founder Joe Doliner.

To that end, the developers looked to the Git model to create a repository of well-documented, reusable analysis pipelines so that teams no longer have to build everything from scratch.

“We think it’s inevitable that containers will precipitate the end of Hadoop because they cause you to completely rethink all the assumptions that [were the basis] of Hadoop in the early days,” said co-founder Joey Zwicker.

As an example, he notes that in Hadoop, people write their jobs in Java, and it all runs on the JVM.

Read Also:
IoT malware starts showing destructive behavior

“This was a great assumption at the time because Java was the leading platform, there were tons of people on it. But fast-forward to today, and now we have containers. They’re a much better tool for the job because you can use literally any tool in the whole world. We’ve built a system that allows you to put anything in a container and use it for big data,” he said.

Rather than being required to use Hadoop-specific tools, such as the Hadoop Image Processing Library, you can any of the existing image-processing libraries. You can use any open-source computer vision tool such as OpenCV, ccv, or VXL.

“We believe people are going to want to use the tools that are best in class. Containers allow them to do that,” Zwicker said.

Though it’s written in Go, data scientists can use any languages or libraries that best fit their needs, they say.

The two components Pachyderm developed for the stack are file system and pipeline system.

Pachyderm Pipelines is a system of stringing containers together and doing data analysis with them. You create a containerized program with the tools of your choice that reads and writes to the local filesystem. It uses a FUSE volume to inject data into the container, then automatically replicates the container, showing each one a different chunk of data. This technique enables Pachyderm to scale any code you write to process massive data sets in parallel, according to Zwicker. It doesn’t require using Java at all: If it fits in a container, you can use it for data analysis.

Read Also:
Strata HadoopWorld Fall 2016 postmortem: Maybe AI's the future, but can we make the data science work?

Pachyderm File System is a distributed file system that draws inspiration from git, providing version control over all the data. It’s the core data layer that delivers data to containers. The data is stored in generic object storage such as Amazon’s S3, Google Cloud Storage or the open source Ceph file system. And like Apple’s Time Machine, it provides historical snapshots of how you data looked at different points in time.

“It lets you see how things have changed; it lets people work together,” Zwicker said. “It allows people to not only collaborate on code but on data. One data scientist can build a data set, and another can fork it and build off of it, then merge the results back with the original one. This is something that has been completely missing from the data science tools out there.

 



Chief Analytics Officer Spring 2017

2
May
2017
Chief Analytics Officer Spring 2017

15% off with code MP15

Read Also:
Steve Dempsey: Artificial Intelligence will create new digital divide in data race
Read Also:
What Is The Profession Of Data Science Really About Now And In The Future?

Big Data and Analytics for Healthcare Philadelphia

17
May
2017
Big Data and Analytics for Healthcare Philadelphia

$200 off with code DATA200

Read Also:
How your big data career is killing other jobs

SMX London

23
May
2017
SMX London

10% off with code 7WDATASMX

Read Also:
Strata HadoopWorld Fall 2016 postmortem: Maybe AI's the future, but can we make the data science work?

Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
What Is The Profession Of Data Science Really About Now And In The Future?

AI Paris

6
Jun
2017
AI Paris

20% off with code AIP17-7WDATA-20

Read Also:
Strata HadoopWorld Fall 2016 postmortem: Maybe AI's the future, but can we make the data science work?

Leave a Reply

Your email address will not be published. Required fields are marked *