When data wrangling vendor Trifacta was launched in 2011, it was driven by two goals - to improve the productivity of data analysts and to solve one of the biggest and hardest problems of dealing with high-volume unstructured data in a Hadoop environment. Its free desktop edition, Wrangler, has been adopted by more than 4,000 companies in 132 countries to explore, transform and join diverse data for analysis.
DataIQ caught up with chief strategy office and co-founder, Joe Hellerstein, who is also JIm Gray Chair of Computer Science at UC Berkeley, when he was in the UK for the Bg Data Ldn event and launching the company’s new solution, Wrangler Edge, which is aimed at analytics teams and for the first time handles data sets beyond Hadoop. Explaining the move, Hellerstein said: “You run into tonnes of people who have data problems, but who don’t want to manage a Hadoop cluster, so we can now engage with them as well.”
“When we built Trifacta, analysts told us they were spending all their time doing data wrangling. We published that as an academic study. As we’ve brought the product to market, it’s gratifying to discover it was not just an academic problem,” he explained. “The business is going the way I expected. We’ve built the product we said we would build, the problem is the one we said we would solve, and the market for the software is what we hoped.”
Unlike many tech start-ups, Trifacta has not had any need to pivot towards a different market - instead, the problem of enabling the analysis of unstructured data has simply grown bigger and more complex, with new data sources and types adding to the challenge. The machine learning which drives Trifacta continues to be the source of its value, but Wrangler Edge has picked up on users’ desire to be able to intervene and apply their human understanding.
“Our philosophy has been that AI and machine learning are very important, but there are times when human context has a part to play. That’s why we made the interface very easy to go back and forth between the two,” explained Hellerstein.
“That said, our interface was more usable than what most of the market was offering, but was perceived as very technical by some. So we have been working hard to build a ramp between the two that is very smooth, from user-guided to technical. The middle - Expression Builder - was not there until recently for people who need to get involved and over-ride recommendations, but who are not technical. We learned from experience there was a gap there - we didn’t see that at first.”
Hellerstein brings a deep academic understanding of AI to bear on the product’s development, pointing out that a constant thematic question is the tension between model advocacy and model comprehensibility. “Models that have been working very well in recent years, like deep learning, are not humanly comprehensible once they have been trained as to why they do what they do. Simpler models, like decision trees - if x is less than this, go down this branch - are more human understandable, but have less accuracy. Depending on the context, you may want one or the other,” he explained.
“The machine doesn’t have enough signal to know the user context. The data patterns may not be relevant to the user - they need to bring their domain expertise,” he added. For example, web log data may have been used in an IT environment to predict outages and plan upgrades. If marketing wants to use the same data for customer analytics, none of that previous modelling is relevant.
Says Hellerstein: “We knew from the beginning that the mahine learning was an aid, but the user had to drive it.