Building a Common Data Platform for the Enterprise on Apache Hadoop

Building a Common Data Platform for the Enterprise on Apache Hadoop

Building a Common Data Platform for the Enterprise on Apache Hadoop

Read this eGuide to discover the fundamental differences between iPaaS and dPaaS and how the innovative approach of dPaaS gets to the heart of today’s most pressing integration problems, brought to you in partnership with Liaison.

To becomea data-driven enterprise, organizationsmust process all types of data whether it be structured transactions or unstructured file server data such as social, IoT, or machine data. Competitive advantage is at stake, and companies failing to evolve into data-driven organizations risk serious business disruption from competitors and startups.

Fortunately, we live in a time of unprecedented innovation in enterprise software and enterprise data has finally become manageable on a large scale. Thanks to the Apache Hadoop open source framework delivering enterprise archives, data lakes, and advanced analytics applications, enterprise data management solutions are now able to turn the tide on data growth challenges.

Enter the Common Data Platform (CDP): a uniform data collection system for structured and unstructured data featuring low-cost data storage and advanced analytics. In this article, I’m going to define the components of a CDP, and where it stands alongside the traditional enterprise data warehouse.

Read Also:
MariaDB adds Big Data analytics support with ColumnStore 1.0

Apache Hadoop is the backbone of the CDP. Hadoop is an open-source data management system that distributes and processes large amounts of data in parallel (across multiple servers and distributed nodes). It’s engineered with scalability and efficiency in mind and designed to run on low-cost commodity hardware. Using the Hadoop Distributed File System (HDFS), Hive and MapReduce or Spark programming model, Apache Hadoop is able to service most any enterprise workload.

Hadoop supports any data whether structured or unstructured in many different formats making it ideal as a uniform data collection system across the enterprise. By denormalizing data into an Enterprise Business Record (EBR), all enterprise data may be text searched and processed through queries and reports. Unstructured data from file servers, email systems, machine logs and social sources is easily ingested and retrieved as well.

A Hadoop data lake functions as a central repository for data. Data is either transformed as required prior to ingestion or stored “as is,” eliminating the need for heavy extract, transform and load (ETL) processes. Data needed to drive the enterprise may be queried, text searched or staged for further processing by downstream NOSQL analytics or applications and systems. 

Read Also:
What Happens When You Combine Artificial Intelligence and Satellite Imagery

Data lakes also significantly reduce the high cost of interface management and data conversion between production systems. Data conversion and interface management may be centralized with a data lake deployed as a data hub to decouple customizations and point to point interfaces from production systems.

Information governance defines how data is managed and accessed throughout its lifecycle and is an essential component to any enterprise data management strategy whether or not you are using a CDP.

Information Lifecycle Management (ILM) provides the necessary data governance control framework to meet risk and compliance objectives,and ensures that best practices for data retention and classification are deployed.


Leave a Reply

Your email address will not be published. Required fields are marked *