Big data is a universal phenomenon. Every business sector and aspect of society is being touched by the expanding flood of information from sensors, social networks, and streaming data sources. The financial sector is riding this wave as well. We examine here some of the features and benefits of Hadoop (and its family of tools and services) that enable large-scale data processing in finance (and consequently in nearly every other sector).
Three of the greatest benefits of big data are discovery, improved decision support, and greater return on innovation. In the world of finance, these also represent critical business functions:
When confronted with the inevitable avalanche of financial data from many business and customer channels, the modern data-driven firm can find help in the supporting technologies that comprise the Hadoop ecosystem. Hadoop provides much-needed functionality in several areas for the business data analyst. These functions include big data storage, access, warehousing, query, and processing (mining and analytics).
The Hadoop Distributed File System (HDFS, for storage), HBase (for read/write access and database-like querying), Hive (for data warehouse functionality), and Pig (for processing and workflow management) have been around for a while. In addition to these, there are now some new tools and techniques in the Hadoop toolkit.
One of the most recent additions to the Hadoop family is Spark. Spark is a fast general purpose engine for large-scale data processing. Spark speeds up processing by enabling parallel, complex, interactive, in-memory calculations on big data. Spark also provides capabilities for interactive querying, machine learning, graph processing, and stream processing. As financial data streams increase not only in size, but also in real-time response requirements, the opportunities to use Spark will only increase in the months and years ahead.
Another powerful member of the Hadoop stack is Drill. Drill allows financial data analysts to perform what they love the most: interactive self-service ad hoc analyses! These analyses can now be performed on a large scale using Drill, which enables analytics across billions of records. The SQL capabilities of Drill provide a familiarity that we can all appreciate. But it doesn’t stop there. Usually, when we mention “SQL,” we tend to think of relational (schema-based) databases. But Drill can query schema-less datasets as well. This is referred to as NOSQL.
A flat file containing key-value pairs can be easily constructed, incrementally updated, quickly edited, and readily partitioned to different processing nodes on a Hadoop cluster. All of this can be done without the time sink of rebuilding database indices, or modifying the schema, or re-normalizing the database relations.