International Business Machines Corp. has thrown its weight behind Spark, an increasingly popular tool that is used to analyze large amounts of data in real time. The number of such uses is expected to greatly expand as Internet-connected sensors become pervasive in physical objects, a phenomenon known as the Internet of Things. By working to push Spark into the mainstream now, IBM hopes to catch an emerging market at the very early stages.
IBM isn’t the first company to embrace Spark, and Spark isn’t the only tool that can be used to analyze real-time data. But given the scale of IBM and its customer base, its support for the technology, already one of the fastest-growing open-source software projects, could be significant.
Spark, which emerged from the University of California at Berkeley in 2009, addresses some of the limitations of Hadoop, an older open-source software framework that employs a distributed architecture to analyze large amounts of data. Hadoop is less suited for analyzing data in real time, and it has a reputation for being tricky for developers to use, according to Bob Picciano, senior vice president of IBM Analytics. Gartner Inc. has characterized the adoption of Hadoop as steady but slow, with 26% of surveyed companies saying they had a Hadoop project underway, and 46% saying they expected to within a few years. Some users have found it difficult to scale, as the Wall Street Journal has reported.
IBM is expected to formally announce Monday that it will embed Spark into its analytics and commerce platforms, and to offer Spark as a service on its Bluemix development platform. IBM also is expected to announce that it will assign more than 3,500 engineers and developers to work on Spark-related projects around the world, and that it will donate its IBM SystemML machine learning technology to the Spark open-source ecosystem. It also is expected to launch a Spark technology center in San Francisco and help educate data scientists and engineers in the use of Spark on a mass basis.
Spark is part of a broader group of tools and companies that are pushing the frontiers of data analytics
Several startups are working on analytics technology that approaches real-time, too. “What we can do is know everyone who is on the call right now, and analyze whether they were at Starbucks yesterday,” said J. Andrew Rogers, the founder and CTO of a startup in Seattle called SpaceCurve Inc. Mr. Rogers, a database expert who said he worked on large-scale analytic and geospatial systems with Google Earth, developed new technology for SpaceCurve.
Bottlenose Inc., a startup based in Los Angeles, has developed technology that helps companies analyze what co-founder Nova Spivack calls “streaming data.” It can be used to spot real-time trends on social media, with applications in areas such as business, politics and government.
IBM has bet on Spark from the very beginning. It was one of the founders of the Berkeley lab where Spark was first developed.