Big data analytical ecosystem architecture is in early stages of development. Unlike traditional data warehouse / business intelligence (DW/BI) architecture which is designed for structured, internal data, big data systems work with raw unstructured and semi-structured data as well as internal and external data sources. Additionally, organizations may need both batch and (near) real-time data processing capabilities from big data systems. Lambda architecture – developed by Nathan Marz – provides a clear set of architecture principles that allows both batch and real-time or stream data processing to work together while building immutability and recomputation into the system. Batch processes high volumes of data where a group of transactions is collected over a period of time. Data is collected, entered, processed and then batch results produced. Batch processing requires separate programs for input, process and output. An example is payroll and billing systems. In contrast, real-time data processing involves a continual input, process and output of data. Data must be processed in a small time period (or near real-time). Customer services and bank ATMs are examples. Lambda architecture has three (3) layers:
Batch Layer (Apache Hadoop) Hadoop is an open source platform for storing massive amounts of data. Lambda architecture provides “human fault-tolerance” which allows simple data deletion (to remedy human error) where the views are recomputed (immutability and recomputation). The batch layer stores the master data set (HDFS) and computes arbitrary views (MapReduce). Computing views is continuous: new data is aggregated into views when recomputed during MapReduce iterations.