Demand for data by today’s business users is growing exponentially in two ways. First, business users have exhausted the opportunities in the data they hold. They want more sources of data to find new value, and they want the data to be accurate to deliver analytic outcomes. Second, the number of data-savvy business analysts is larger than ever and growing fast. To satisfy the increasing demand, IT departments must field a continuous stream of data requests—big and small.
While business execs see tremendous potential in putting more data at more users’ fingertips, organizations struggle to deliver that data. Traditional data integration tools, developed in the 1990s for populating data warehouses, are too brittle to scale to many data sources. They’re also prohibitively time-consuming and costly.
At the same time, new big data platforms only solve the data collection part of the problem. The data loaded into Hadoop is never enterprise-ready; it requires preparation before reaching a business user. The reality is that most enterprises are dealing with hundreds and thousands of sources. The only path to delivering clean, unified data across all of these sources is automation.
Let’s take a simple example from the insurance industry. For instance, business analysts want to understand claim risk of an upcoming flood. They pull customers for the affected geographic region and filter for those with active home insurance. Immediately they need to drill into high property values and sparsely populated areas. This requires individual policies to be classified into more granular categories. Then they want a broad, accurate view so that they can correlate customers with both home and car insurance (in a different system, of course).
They also want to enrich the data with up-to-date property value information, so they need to bring in an external benchmark of real estate pricing. Finally, this analysis needs to be done in near-real time to take action. Waiting a quarter or even a full month is out of the question. In all of these steps, the analysts need to be working off of clean and trusted data in order to draw correct conclusions and make data-driven decisions. To summarize the challenges that need to be met:
The two most challenging aspects of automating the delivery of data across many different sources are mastering and classification. Competent ETL engineers can do basic transforms like look-ups or minor calculations quickly and easily. But advanced tasks such as identifying global corporate entities or product categories across millions of records can't be easily scripted and maintained. We’ve all heard the example of matching “I.B.M.” to “International Business Machines,” but the problem is actually much more difficult. IBM has hundreds of subsidiaries, brands, and products. Mastering all of those together and bundling it up to be used by a businessperson is no small matching feat.
At Tamr, we’ve built a solution to automate these complex tasks to ensure rich and accurate data. Tamr uses a machine-driven but human-guided workflow to ensure the automation is efficient, accurate, and trustworthy. Tamr uses machine learning algorithms to predict how individual records should be classified and matched (as products, organizations, or individuals). For instance, if an invoice comes in with the description “Latex Gloves,” our algorithm might classify it as “Laboratory Supplies” and match it to the product “Rubber Gloves.” Tamr uses the entire record to “predict” these classifications and matches—everything from the description to the price to less obvious indicators like who created that invoice.
To ensure that these predictions are accurate, we have a workflow for experts and users to give feedback. Tamr’s algorithms are built to iterate on the feedback. Under the covers, we use supervised learning techniques to tune weights and improve accuracy.