Data processing and analytical modelling are major bottlenecks in today’s big data world, due to need of human intelligence to decide relationships between data, required data engineering tasks, analytical models and it’s parameters. This article talks about Smart Data Platform to help to solve such problems.
The concept of big data has been in vogue for about 5 years now. Judging by Google Trends, big data was consistently and rapidly gaining increasing visibility from 2011 to 2015, after which its trendiness gradually flat-lined. The truth is that big data has moved beyond the “visionary” stage of development. People are now waiting for big data to be applied to numerous industries and generate a tremendous amount of value. TalkingData has being cultivating the field of big data in China for 5 years now. Having experienced rapid growth, we lead the big data apps industry for many traditional sectors. However, our growth has brought tremendous demands on our R&D, consulting, and data science resources. In order to ensure optimal service quality, we have had to turn away many potential clients. That’s because the value-realization process is extremely expensive. Aside from basic hardware and software investments, the biggest cost comes from human resources. A great deal of manpower is needed to build and maintain such applications. When we want to modify these apps’ goals, each change also requires further resources.
For the medium/small-sized businesses and traditional sector actors, what they really need is a relatively cheap and fancy-free version of big data—in other words, a big data platform that drastically lowers the entry requirement. Smart Data Platform is such a platform. It will drastically reduce a business’ cost to build, operate, and maintain their data platform. Businesses will be able to make their core businesses more efficient with minimal marginal cost; what’s more, they will be able to bolster their earnings from small cases and small scenarios without incurring prohibitively high expenses.
The idea of Smart Data Platform encompasses Data management, Data engineering, and data science. Right now big data’s biggest bottlenecks are data processing and analytical modelling. TalkingData have been working a solution to these two problems, and here we want to talk about their future outlook.
Currently data processing is almost entirely reliant on individual human minds. Humans are needed to decide how to cleanse, correct, standardize, and aggregate similar data—not to mention identifying data relationships. Before the arrival of big data, few regarded this as a problem. However, there have been a whopping 204 papers about data processing submitted to conferences (such as VLDB and SIGMOD) since big data became “hot” in 2012. However we are only beginning to tackle the problem of smart data processing, and there is no mature open source project or business product available. Drawing on our practical experience with and follow-up research on this topic, TalkingData has divided smart data processing into two phases—data relationship identification and data item aggregation.
Data relationship identification involves first identifying all the metadata in a set of tables/files, then using the relationship between the metadata to identify the relationship between the tables/files themselves. If we are to automate this process, we must first tackle three problems.
First and the simplest of three is that how would we directly identify metadata. This can be achieved by establishing rules based on human experience. For example, if we want to identify cell phone number fields, we can establish rules based on how cell phone number are usually named. Of course, it is unrealistic to expect that pre-established rules can cover all the scenarios, and here is where active learning comes in. When the case is uncertain, the user can intervene and make a decision—which the computer will use to establish new rules.