Wrangling Data in a Big Data World
- by 7wData
It’s Monday morning and Luke, the BIGCO brand manager, walks into his office. He asks the digital assistant device on his desk how BIGCO’s Acme soft-drinks brand performed over the weekend. After a second, the assistant replies that Acme’s share of sales has fallen by 0.5%. Luke asks what’s driving the decrease, and is told it’s due to issues in the BIGCO West region. His digital assistant offers to email Tom, the sales manager, with a summary of their findings.
Two hours later in California, Tom uses the report to drill down to the root cause, combining sales data with BIGCO’s shipment data. After he blends in third-party data, including market share, weather and econometrics, it looks as if failure to promote Acme during prolonged periods of good weather is a contributing factor. Tom is able to simulate the effect of various promotions on brand share and profitability, and creates a plan. Luke approves the plan, and his digital assistant makes a note to monitor the situation and report progress.
Digital assistants and smart machines are cool, but the most useful business insights come from combining internal data with a multiplicity of external data sources, whether it be sales, shipments, promotions, financials, or a hundred other things. It’s data integration that allows this scenario to play out. Without data integration, the only thing cool technology can do is stare helplessly at a pile of bricks it can’t assemble into anything useful.
Data integration relies on the ability to link fields containing the same information, for example information about states, in different datasets. If all datasets used the same identifiers, this is would be easy, but they don’t: some use two-digit identifiers (IL, CT), some use full names (Illinois, Connecticut), and so on. And that’s a simple example: Universal Product Codes (UPC) identify a type of product (10-oz. can of BIGCO’s Acme soda, say) and can be used for point of sale and stock keeping. Electronic Product Codes can be used to identify individual items—so every single can (more likely, every bottle of champagne, as people aren't terribly interested in tracking by the soda can) could have a different code. Other codes are used to identify aggregations of products, such as in-store combo packs and warehouse pallets.
Data integration means reconciling these different entities and coding systems. Part of the process is to make the data ready for analysis by either aggregating it up or disaggregating it down to a common basis so that, for example, point-of-sale data at UPC level can be combined with advertising data at brand level. Finally, data has to be enriched to enhance its usefulness for analytics: supplementing a brief product description, say, with codified attributes such as manufacturer, brand, size, flavor, packaging, health claims and ingredients.
The gold standard for data integration is the extract, transform and load (ETL) process associated with the data warehouse. ETL provides an automated, high-quality process with defined outcomes, and is the best way to curate long-lived, high-value assets, such as the data used in C-suite dashboards and KPIs.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
Evolving Your Data Architecture for Trustworthy Generative AI
18 April 2024
5 PM CET – 6 PM CET
Read MoreShift Difficult Problems Left with Graph Analysis on Streaming Data
29 April 2024
12 PM ET – 1 PM ET
Read More