Instrumenting your data for real-time analysis and streaming

Instrumenting your data for real-time analysis and streaming

Instrumenting your data for real-time analysis and streaming

Creating a clean, maintainable and powerful dataset can be a real challenge, especially when collecting data from sources that are evolving. Many changes at the source of the data need to be represented in the format of the data collected, and maintaining compatibility over time can be incredibly time consuming and frustrating - a growing backlog of ETL and data munging, janitorial tasks that have to come before the actual analysis of the data.

If you’re responsible for instrumenting your data sources and determining formats, there are a few helpful heuristics that can prevent these janitorial tasks from taking too much time in the future - but also, many ways of making the data easier to analyze and interact with.

One of the key considerations for designing a healthy dataset is making sure that the values you collect are in the best type for what you’re measuring. By (irresponsibly) oversimplifying the world of computing we can say that there are five key types to use:

Assigning the right type to an attribute can make all the difference in the world: doing analysis across types is painful, time consuming and only gets more so at scale. Having worked with a ton of in-house analytics solutions, we’ve seen first-hand on many occasions how costly a poorly typed attribute can be.

Read Also:
Using AI, Microsoft hopes to treat blindness

A great example of a poorly typed value is having an attribute Latency that returns a string like “24 ms”, rather than a simple number like “24”. The engineer responsible for this real world example almost certainly thought including the unit of measurement in the value was a nice way of documenting his code, but he had inadvertently shot himself in the foot - it’s unnecessarily complex to ask the average of a string. His string value generated great human-readable logs for his latency checks, but didn’t allow for simple aggregate analysis. In a noSQL environment he’d run a MapReduce to get the answer, and in a SQL environment he’d have to write unnecessarily complex and low-performance queries that would run the risk of crashing the program. If another string like “unknown”, something that couldn’t be meaningfully converted to a number, made it’s way into the mix you’d have a serious headache.

Having strongly typed data makes such undesirable states unrepresentable. This is why many experienced data wizards hate the .csv format - the incredibly helpful type information is not carried between commas. We wrote a schema inference engine and version control system for data models into our platform, Traintracks, to make sure data was always perfectly type-safe, and it allows us to generate interfaces that strictly prohibit malformed questions. Strongly typed data pays for itself every day.

Read Also:
Data storage considerations for your digital transformation

A healthy dataset will contain unique identifiers that make sure you can drill down on specific entities. Although something like username can be a nice thing to store, it’s unique identifiers like userID or sessionID that allow you to isolate specific entities without risk of confusion or bad data. We strongly recommend recording all unique identifiers as strings - even if they look like numbers to the human eye. Experience tells us that most queries you’ll want to run about unique identifiers will not be what the minimum, maximum or average of your userID attribute is - but rather finding userIDs that match, contain or begin with something.

As the latency example above showed, it’s helpful to make sure you record your measurements as pure numbers.



Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
How to compete against data scientists charging $30/hour

AI Paris

6
Jun
2017
AI Paris

20% off with code AIP17-7WDATA-20

Read Also:
10 Required Non-technical Skills for a Data Scientist
Read Also:
Using AI, Microsoft hopes to treat blindness

Chief Data Officer Summit San Francisco

7
Jun
2017
Chief Data Officer Summit San Francisco

$200 off with code DATA200

Read Also:
The 4 Mistakes Most Managers Make with Analytics

Customer Analytics Innovation Summit Chicago

7
Jun
2017
Customer Analytics Innovation Summit Chicago

$200 off with code DATA200

Read Also:
Healthcare Industry Finds New Solutions to Big Data Storage Challenges

HR & Workforce Analytics Innovation Summit 2017 London

12
Jun
2017
HR & Workforce Analytics Innovation Summit 2017 London

$200 off with code DATA200

Read Also:
All The Best Big Data Tools And How To Use Them

Leave a Reply

Your email address will not be published. Required fields are marked *