Extracting structured information from semi-structured or unstructured data sources (“dark data”) is an important problem. One can take it a step further by attempting to automatically build a knowledge graph from the same data sources. Knowledge databases and graphs are built using (semi-supervised) machine learning, and then subsequently used to power intelligent systems that form the basis of AI applications. The more advanced messaging and chat bots you’ve encountered rely on these knowledge stores to interact with users.
In this episode of the Data Show, I spoke with Mike Tung, founder and CEO of Diffbot – a company dedicated to building large-scale knowledge databases. Diffbot is at the heart of many web applications, and it’s starting to power a wide array of intelligent applications. We talked about the challenges of building a web-scale platform for doing highly accurate, semi-supervised, structured data extraction. We also took a tour through the AI landscape, and the early days of self-driving cars.
Here are some highlights from our conversation:
If you think about the Web as a virtual world, there are more pixels on the surface area of the Web than there are square millimeters on the surface of the earth. As a surface for computer vision and parsing, it’s amazing, and you don’t have to actually build a physical robot in order to traverse the Web. It is pretty tricky though. … For example, Google has a knowledge graph team—I’m sure your listeners are aware from a startup that was building something called Freebase, which is crowdsourced, kind of like a Wikipedia for data. They’ve continued to build upon that at Google adding more and more human curators. … It’s a mix of software, but there’s definitely thousands and thousands of people that actually contribute to their knowledge graph. Whereas in contrast, we are a team of 15 of the top AI people in the world. We don’t have anyone that’s curating the knowledge. All of the knowledge is completely synthesized by our AI system. When our customers use our service, they’re directly using the output of the AI. There’s no human involved in the loop of our business model. … Our high level goal is to build the largest structured database of knowledge. The most comprehensive map of all of the entities and the facts about those entities. The way we’re doing it is by combining multiple data sources. One of them is the Web, so we have this crawler that’s crawling the entire surface area of the Web.
If you look at other groups doing AI research, a lot of them are focused on very much the same as the academic style of research, which is coming out of new algorithms and publishing to sort of the same conferences.