Big data’s emergence promised tremendous opportunities for businesses to gain real-time insights and make more informed decisions. But as is often the case with disruptive technologies, the innovations behind big data created a critical problem: data drift.
Data drift creates serious technical and business challenges for businesses looking to harness the insights and full potential big data offers. Fortunately, informing yourself about data drift is the first step toward mitigating its harmful effects on data quality.
Data drift is a natural consequence of the diversity of big data sources. The operation, maintenance and modernization of these systems causes unpredictable, unannounced and unending mutations of data characteristics.
Consider systems, such as mobile interactions, sensor logs and web clickstreams. The data those systems create changes constantly as the business tweaks, updates or even re-platforms those systems. The sum of these changes is data drift.
Data drift exists in three forms: structural drift, semantic drift and infrastructure drift.
Structural drift occurs when the data schema changes at the source. Common examples of structural drift are fields being added, deleted and re-ordered, or the type of field being changed. For instance, a bank adds leading characters to its text-based account numbers to support a growing customer base. The change to the data causes the bank’s customer service system to conflate data related to bank account 00-23456 with account 01-23456.
Semantic drift occurs when the meaning of the data changes, even when the structure hasn’t. A real-world example comes from a digital marketing firm that saw a sudden revenue spike. After some deep digging, they determined that the spike was in fact a false positive caused by their migration from IPv4 to IPv6 network addressing, which led the agency’s analytic system to misrepresent the data.;