Apache Kafka, the open source distributed streaming platform, is making an increasingly vocal claim for stream data "world domination" (to coin Linus Torvald's whimsical initial modest goals with Linux).Last summer I wrote about Kafka and the company behind its enterprise rise, Confluent. Kafka adoption was accelerating as the central platform for managing streaming data in organizations, with production deployments of Kafka claiming six of the top 10 travel companies, seven of the top 10 global banks, eight of the top 10 insurance companies, and nine of the top 10 US telecom companies.
Today, it's used in production by more than a third of the Fortune 500.
But 2016 may be most noted for Kafka joining the "Four Commas Club," a nod to the popular HBO comedy series "Silicon Valley" where character Russ Hanneman is the flashy and obnoxious billionaire investor who is quick to point out to the tormented heroes of the show that it takes three commas to make the number 1,000,000,000. Last year Linkedin, Microsoft, and Netflix all passed the threshold of processing more than one TRILLION messages a day over Kafka. That's four commas: 1,000,000,000,000. That's scale.
I asked Confluent CTO and co-founder Neha Narkhede what was behind these numbers.
TechRepublic: Kafka is putting up very large numbers, even for the big data world. I think a lot of people saw the technology as a messaging queue that scaled, kind of a scale-out enterprise bus that moved data very fast. Something more seems to be happening here.
Narkhede: My co-founders and I originally created Kafka at Linkedin in 2010 when our own systems ran up against the limits of a monolithic, centralized database. We saw the need for a distributed architecture with microservices that we could scale quickly and robustly. The legacy systems couldn't help us anymore. On one hand, the traditional messaging queues were real-time but didn't scale and, on the other, the ETL systems couldn't handle data in real-time.
We looked deeply into the architecture of existing systems, why it didn't work and combined that with our experience in modern distributed systems to create Kafka. It was built to be real-time, could store data to feed batch systems from the same pipe, and could enable stream processing to make sense of the data in real-time in addition to moving it around. We had the vision of building the entire company's business logic as stream processors that express transformations on streams of data.
In order to do that, you need a highly efficient pipe to move data around, need connectors to existing systems, and need a stream processing layer. That is what we call a complete streaming platform. So though Kafka started off as a very scalable messaging system, it grew to complete our vision of being a distributed streaming platform.
TechRepublic: Are there many enterprises doing real-time stream processing at scale today? A lot of the Fortune 500 is still wondering how to monetize all the data they sucked into their Hadoop clusters in the first place.
Narkhede: It's true that most of the sophisticated backend data processing in enterprises is actually conducted by big batch processes that run on big daily data dumps (the Hadoop people rely on Kafka as the preferred data pipeline to their Hadoop clusters, by the way).
Data Innovation Summit 2017
30% off with code 7wData
Big Data Innovation Summit London
$200 off with code DATA200
Enterprise Data World 2017
$200 off with code 7WDATA
Data Visualisation Summit San Francisco
$200 off with code DATA200
Chief Analytics Officer Europe
15% off with code 7WDCAO17