IT teams looking to build big data architectures have an abundance of technology choices they can mix and match to meet their data processing and analytics needs. But there’s a downside to all that: Putting the required pieces together is a daunting task.
Finding and deploying the right big data technologies within the expanding Hadoop ecosystem is a lengthy process frequently measured in years, unless corporate executives throw ample amounts of money and resources at projects to speed them up. Missteps are common, and one company’s architectural blueprint won’t necessarily translate to other organizations, even in the same industry.
“I tell people that it’s not something you can order from Amazon or get from the Apple Store,” said Bryan Lari, director of institutional analytics at the University of Texas MD Anderson Cancer Center in Houston. Fully constructing a big data architecture, he added, “is complex, and it’s a journey. It’s not something we’re going to implement in six months or a year.” Nor is there an easy-to-apply technology formula to follow. “Depending on the use case or user, there are different tools that do what we need to do,” Lari said.
MD Anderson’s big data environment is centered on a Hadoop cluster that went into production use in March, initially to process vital-signs data collected from monitoring devices in patients’ rooms. But the data lake platform also includes HBase; Hadoop’s companion NoSQL database; the Hive SQL-on-Hadoop software; and various other Apache open source technologies such as Pig, Sqoop, Oozie and Zookeeper. In addition, the cancer treatment and research organization has deployed an Oracle data warehouse as a downstream repository to support analytics and reporting applications, plus IBM’s Watson cognitive computing system to provide natural language processing and machine learning capabilities. New data visualization, governance and security tools are due to be added in the future, too.
The IT team at MD Anderson began working with Hadoop in early 2015. To demo some potential applications and learn about the technology, the center first built a pilot cluster using the base Apache Hadoop software; later, it brought in the Hortonworks distribution of Hadoop for the production deployment.
Vamshi Punugoti, associate director of research information systems at MD Anderson, said the experience gained in the pilot project should also help make it easier to cope with modifications to the architecture that likely will be needed as new big data tools emerge to augment or replace existing ones. “It’s a continually evolving field, and even the data we’re collecting is constantly changing,” Punugoti said. “It would be naïve to assume we have it all covered.”
A platform engineering team at ride-sharing company Uber similarly spent about 12 months building a multifaceted big data architecture, but with even more technology components, and in more of a hurry-up mode. Vinoth Chandar, a senior software engineer on Uber’s Hadoop team, said the San Francisco-based company’s existing systems couldn’t keep up with the volumes of data that its fast-growing business operations were generating. As a result, much of the data couldn’t be analyzed in a timely manner — a big problem because Uber’s business “is inherently real time” in nature, Chandar said. To enable operations managers to become more data-driven, Chandar and his colleagues set up a Hadoop data lake environment that also includes HBase, Hive, the Spark processing engine, the Kafka message queueing system and a mix of other technologies. Some of the technologies are homegrown, for example, a data ingestion tool called Streamific. With the architecture in place, Uber is “catching up to the state of the art on big data and analytics,” Chandar said. But it wasn’t easy getting there. “To pull it all together, I can say that 10 people didn’t sleep for a year,” he added half-jokingly.