New smart data tools are rapidly overcoming the common challenges presented by the newly emerging data lake. These tools make it easy to semantically link, analyze and manage diverse data, structured and unstructured, at big data scale and to make it available for self-service consumption by business users.
Data lakes are usually defined as large repositories of data, stored in native format and hosted on commodity hardware. Their appeal lies in the ability to rapidly assemble large volumes of unfiltered data and to store it cheaply relative to traditional data warehouses. The challenges with data lakes arise when it comes to harmonizing the data and making available to business users. This process is labor intensive and requires skilled data science and IT staff.
The smart data tools take a different approach, using business friendly semantic models to describe, link and contextualize the data and to make it easily accessible to business users. Following are five key capabilities critical to “democratizing” your data lake project to make it accessible and usable for all business users:
1. A common way to transform and harmonize your enterprise data, regardless of its source, structured or unstructured, inside or outside the enterprise.
Driven by the semantic model, scalable servers can convert data from all formats, structured and unstructured into an RDF graph format. An appropriate number of servers may be deployed to accommodate the number of sources and total volume of incoming data, including automatic incremental updates. Depending on the nature of each of the data sources, one or more of the techniques will apply:
- Mapping and transformation of structured or tabular data
- Text analytics, converting unstructured data to structured graphs
- Custom plugins for data sources with APIs or proprietary formats
- High-performance mapping and transforming using Apache SPARK to bring existing Hadoop data into the smart data lake environment
2. A unified way to describe all the data in your lake – what it means and how it links together – in terms that are meaningful to business users.
A semantic data model can easily capture and deliver the “meaning” of data in a data lake with all the inherent, relationships and attributes. In the relational world, such a model would require translation to a relational logical model – schema and tables carefully constructed by database experts with indexes to optimize sets of known or anticipated questions. Posing such questions requires translation into SQL queries with joins and optimization – out of the reach of business users and even most data scientists.
A semantic graph model, on the other hand, requires no such translation. The data is stored exactly in the way it is modeled – the way business users think – allowing questions to be asked and new hypotheses explored on the fly.
3. Business user self-service capabilities for browsing, searching and combining the data sets required for their use case.
Business users should be able to configure search and visualization dashboards for valuable views, analytics and insights. When a dashboard is completed, the in-memory graph engine can be interactively analyzed and traversed in any direction.
4. Ad hoc data discovery and analytics tools that allow business users to ask any question of the data, including ones not anticipated in advance.
A top-notch query engine based on elastic clustered, in-memory computing is key. The engine should offer interactive ad-hoc query and analytics on datasets with billions of triples to offer end users powerful analytic workflows. Instead of relying on rectangular extracts of data, users should be able to create tables, filters, charts and visualizations by intuitively exploring paths through the full model, applying filters to refine what specific data is relevant. This approach combines data discovery and analytics with speed and agility – arriving at answers to new and ad-hoc questions quickly and without requesting support from IT.
5. Support for enterprise data quality, governance and access control policies.
A careful program of flexibility and reuse, balanced with methodology and controls will ensure that access control, security, full data lineage or provenance and data context are all preserved. Democratized big data – enabling everyone to discover and analyze all the data – will require applying governance, security and flexible policies.