When Doug Cutting created the Hadoop framework 10 years ago he never expected it to bring massive-scale computing to the corporate world.
“My expectations were more moderate than what we’ve seen, for sure,” he said, speaking at the Strata and Hadoop World conference.
Today Hadoop is used by many household names, helping Facebook analyse traffic for its more than 1.6 billion monthly users and aiding Visa in uncovering billions of dollars-worth of fraud.
The attraction of Hadoop lies in how it allows big data to be processed more cheaply and, in certain respects, more simply. The platform provides a group of technologies that allow very large datasets to be spread across large clusters of commodity servers and processed in parallel.
Yet there are limitations to what the platform can do. Today, the speed at which Hadoop clusters can process very large datasets is capped by the rate at which data is shuttled between secondary storage — SSDs or even slower spinning discs — and a computer’s memory and CPU.
This I/O bottleneck has arisen because processor speed and efficiency are increasing faster than storage read-write rates.
But now storage technology is poised to undergo a significant shift, one which Cutting said will help take the brakes off big data processing.
This year, Intel plans to release its 3D XPoint storage chips, which can retrieve data 1,000 times faster than the NAND flash typically used in SSDs, while also being able to store data at a density ten times greater than is possible in memory typically used today, known as DRAM.
While XPoint will initially be offered as storage in the form of Optane-branded SSDs, Intel is planning to follow that up by releasing XPoint memory modules. Thanks to XPoint storing data at far higher densities than traditional DRAM, these modules will allow servers to have a far larger memory than is the norm today. Intel has talked about Intel Xeon servers being available next year with 6TB of memory, made up of a mixture of DDR4 DRAM and XPoint. That said XPoint won’t match DDR4 DRAM for performance. The nine microsecond latency and 70,300 read/write IOPS of pre-release XPoint SSDs is slower than DRAM and no more than 20 times faster than high-performance SSDs by some estimates.
Regardless, Cutting predicts the use of XPoint and other non-volatile memory in Hadoop clusters will open up the platform to new uses, allowing users to process much larger datasets in memory, which in turn will bypass the latency inherent in fetching data from disk.
“If you could have a petabyte of data in memory, accessible from any node within cycles, that’s several layers of magnitude performance improvement, if you’re doing certain kinds of algorithms,” said Cutting, who is now chief architect at Cloudera, which offers its own distribution of Hadoop.
“Things that are very expensive now, like graph operations, various sorts of iterative machine learning algorithms, clustering — things that have traditionally taken a very long time — can now be done very quickly and over pretty impressive amounts of data.