Instrumenting your data for real-time analysis and streaming

Instrumenting your data for real-time analysis and streaming

Instrumenting your data for real-time analysis and streaming

Creating a clean, maintainable and powerful dataset can be a real challenge, especially when collecting data from sources that are evolving. Many changes at the source of the data need to be represented in the format of the data collected, and maintaining compatibility over time can be incredibly time consuming and frustrating - a growing backlog of ETL and data munging, janitorial tasks that have to come before the actual analysis of the data.

If you’re responsible for instrumenting your data sources and determining formats, there are a few helpful heuristics that can prevent these janitorial tasks from taking too much time in the future - but also, many ways of making the data easier to analyze and interact with.

One of the key considerations for designing a healthy dataset is making sure that the values you collect are in the best type for what you’re measuring. By (irresponsibly) oversimplifying the world of computing we can say that there are five key types to use:

Assigning the right type to an attribute can make all the difference in the world: doing analysis across types is painful, time consuming and only gets more so at scale. Having worked with a ton of in-house analytics solutions, we’ve seen first-hand on many occasions how costly a poorly typed attribute can be.

Read Also:
The Much-Needed Business Facet for Modern Data Integration

A great example of a poorly typed value is having an attribute Latency that returns a string like “24 ms”, rather than a simple number like “24”. The engineer responsible for this real world example almost certainly thought including the unit of measurement in the value was a nice way of documenting his code, but he had inadvertently shot himself in the foot - it’s unnecessarily complex to ask the average of a string. His string value generated great human-readable logs for his latency checks, but didn’t allow for simple aggregate analysis. In a noSQL environment he’d run a MapReduce to get the answer, and in a SQL environment he’d have to write unnecessarily complex and low-performance queries that would run the risk of crashing the program. If another string like “unknown”, something that couldn’t be meaningfully converted to a number, made it’s way into the mix you’d have a serious headache.

Having strongly typed data makes such undesirable states unrepresentable. This is why many experienced data wizards hate the .csv format - the incredibly helpful type information is not carried between commas. We wrote a schema inference engine and version control system for data models into our platform, Traintracks, to make sure data was always perfectly type-safe, and it allows us to generate interfaces that strictly prohibit malformed questions. Strongly typed data pays for itself every day.

Read Also:
How to compete against data scientists charging $30/hour

A healthy dataset will contain unique identifiers that make sure you can drill down on specific entities. Although something like username can be a nice thing to store, it’s unique identifiers like userID or sessionID that allow you to isolate specific entities without risk of confusion or bad data. We strongly recommend recording all unique identifiers as strings - even if they look like numbers to the human eye. Experience tells us that most queries you’ll want to run about unique identifiers will not be what the minimum, maximum or average of your userID attribute is - but rather finding userIDs that match, contain or begin with something.

As the latency example above showed, it’s helpful to make sure you record your measurements as pure numbers.



Sentiment Analysis Symposium

27
Jun
2017
Sentiment Analysis Symposium

15% off with code 7WDATA

Read Also:
The five stages of machine learning implementation

Data Analytics and Behavioural Science Applied to Retail and Consumer Markets

28
Jun
2017
Data Analytics and Behavioural Science Applied to Retail and Consumer Markets

15% off with code 7WDATA

Read Also:
Lambda Architecture for Big Data Systems
Read Also:
Splice Machine 2.0 combines HBase, Spark, NoSQL, relational...and goes open source

AI, Machine Learning and Sentiment Analysis Applied to Finance

28
Jun
2017
AI, Machine Learning and Sentiment Analysis Applied to Finance

15% off with code 7WDATA

Read Also:
14 Traits Of The Best Data Scientists

Real Business Intelligence

11
Jul
2017
Real Business Intelligence

25% off with code RBIYM01

Read Also:
The broken promise of open-source Big Data software – and what might fix it

Advanced Analytics Forum

20
Sep
2017
Advanced Analytics Forum

15% off with code Discount15

Read Also:
Top mistakes data scientists make when dealing with business people

Leave a Reply

Your email address will not be published. Required fields are marked *