elephant-1302155_640

Pachyderm Challenges Hadoop with Containerized Data Lakes

Pachyderm Challenges Hadoop with Containerized Data Lakes

The folks at Pachyderm believe there’s an elephant in the room when it comes to data analytics, namely the weaknesses of Hadoop. So they set out to build an Hadoop alternative based on containers, version one of which has just been released.

The Pachyderm stack uses Docker containers as well as CoreOS and Kubernetes for cluster management. It replaces HDFS with its Pachyderm File System and MapReduce with Pachyderm Pipelines.

The company has focused on fundamental principles including reproducibility, data provenance and, most importantly, collaboration – a feature they say has been sorely missing from the Big Data world and one that has generated the most excitement from potential users, according to CEO and co-founder Joe Doliner.

To that end, the developers looked to the Git model to create a repository of well-documented, reusable analysis pipelines so that teams no longer have to build everything from scratch.

“We think it’s inevitable that containers will precipitate the end of Hadoop because they cause you to completely rethink all the assumptions that [were the basis] of Hadoop in the early days,” said co-founder Joey Zwicker.

Read Also:
Applying Big Data Analytics to Tackle 3 Financial Marketing Challenges

As an example, he notes that in Hadoop, people write their jobs in Java, and it all runs on the JVM.

“This was a great assumption at the time because Java was the leading platform, there were tons of people on it. But fast-forward to today, and now we have containers. They’re a much better tool for the job because you can use literally any tool in the whole world. We’ve built a system that allows you to put anything in a container and use it for Big Data,” he said.

Rather than being required to use Hadoop-specific tools, such as the Hadoop Image Processing Library, you can any of the existing image-processing libraries. You can use any open-source computer vision tool such as OpenCV, ccv, or VXL.

“We believe people are going to want to use the tools that are best in class. Containers allow them to do that,” Zwicker said.

Though it’s written in Go, data scientists can use any languages or libraries that best fit their needs, they say.

The two components Pachyderm developed for the stack are file system and pipeline system.

Read Also:
Aylien launches news analysis API powered by its deep learning tech

Pachyderm Pipelines is a system of stringing containers together and doing data analysis with them. You create a containerized program with the tools of your choice that reads and writes to the local filesystem. It uses a FUSE volume to inject data into the container, then automatically replicates the container, showing each one a different chunk of data. This technique enables Pachyderm to scale any code you write to process massive data sets in parallel, according to Zwicker. It doesn’t require using Java at all: If it fits in a container, you can use it for data analysis.

Pachyderm File System is a distributed file system that draws inspiration from git, providing version control over all the data. It’s the core data layer that delivers data to containers. The data is stored in generic object storage such as Amazon’s S3, Google Cloud Storage or the open source Ceph file system. And like Apple’s Time Machine, it provides historical snapshots of how you data looked at different points in time.

“It lets you see how things have changed; it lets people work together,” Zwicker said. “It allows people to not only collaborate on code but on data. One data scientist can build a data set, and another can fork it and build off of it, then merge the results back with the original one. This is something that has been completely missing from the data science tools out there.

Read Also:
Vulnerability Is The Most Concerning ‘V’ Of Big Data

 



Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
Taming Big Data in the IoT

AI Paris

6
Jun
2017
AI Paris

20% off with code AIP17-7WDATA-20

Read Also:
AI cruise control: China wants high-level artificial intelligence for next-gen missiles

Chief Data Officer Summit San Francisco

7
Jun
2017
Chief Data Officer Summit San Francisco

$200 off with code DATA200

Read Also:
Applying Big Data Analytics to Tackle 3 Financial Marketing Challenges

Customer Analytics Innovation Summit Chicago

7
Jun
2017
Customer Analytics Innovation Summit Chicago

$200 off with code DATA200

Read Also:
AI cruise control: China wants high-level artificial intelligence for next-gen missiles

HR & Workforce Analytics Innovation Summit 2017 London

12
Jun
2017
HR & Workforce Analytics Innovation Summit 2017 London

$200 off with code DATA200

Read Also:
How Big Data Is Disrupting the Logistics Industry

Leave a Reply

Your email address will not be published. Required fields are marked *