Internet of things (IoT) data promises to unlock unique and unprecedented business insights, but only if enterprises can successfully manage the data flowing into their organizations from IoT sources. One problem enterprises will encounter as they try to elicit value from their IoT initiatives is data drift: changes to the structure, content, and meaning of data that result from frequent and unpredictable changes to source devices and data processing infrastructure.
Whether processed in stream or batch form, data typically moves from source to final storage locations through a variety of tools. Changes anywhere along this chain — be they schema changes to source systems, shifts in the meaning of coded field values, or an upgrade or addition to the software components involved in data production — can result in incomplete, inaccurate, or inconsistent data in downstream systems.
The effects of this data drift can be especially pernicious because they often go undetected for long periods of times, polluting data stores and subsequent analyses with low-fidelity data. Until detected, the use of this problematic data can lead to false findings and poor business decisions. When the problem is finally detected, it is usually fixed through manual data cleanup and preparation by data scientists, which adds hard costs, opportunity costs, and delays to the analysis.
Using StreamSets Data Collector to build and manage big data ingest pipelines will help mitigate the effects of data drift while vastly reducing the amount of time spent cleansing data. In this article, we will walk through a typical use case of real-time data ingest of IoT sensor data into HDFS for analysis and visualization using Impala or Hive.
Without writing a single line of code, StreamSets Data Collector can ingest streaming and batch data from a large number of sources. StreamSets Data Collector can perform transformations and sanitize the data in-stream, then write to a large number of destinations. When the pipeline is placed in operation, you get fine-grained data flow metrics, detection of anomalous data, and alerting so that you can stay on top of pipeline performance. StreamSets Data Collector can run standalone or be deployed onto a Hadoop cluster, and it offers connectors to a variety of data source and destination types.
The following use case involves data generated in real time from shipping containers.
The first example of data drift manifests itself in the IoT sensors that the shipping company uses. Due to upgrades over time, the sensors in the field run one of three different firmware versions. Each revision adds new data fields and changes the schema. To derive value from that sensor data, the system we use to ingest the information must be able to handle this diversity.
Our pipeline reads data from a RabbitMQ system that receives MQTT messages from the sensors out in the field. We check to verify that the messages we are receiving are those we want to work with. To do so, we use a stream selector processor to specify a data rule for the incoming messages. We then use this rule to declare that all data matching the rule’s criteria is routed downstream, but anything that doesn’t match the criteria will be discarded.
We then use another stream selector to route data based on the firmware version of the device. All records matching firmware version 1 go to one path, those matching version 2 go to another, and so forth. We also specify a default catch-all rule to send any outliers to an “error” path. With modern data streams, we fully expect that the data will unexpectedly change, so we can set up graceful error handling that shunts anomalous records to a local file, a Kafka stream, or a secondary pipeline. That way we can keep the pipeline running and simultaneously reprocess data that doesn’t fit the primary intents after the fact.