Instrumenting your data for real-time analysis and streaming

Instrumenting your data for real-time analysis and streaming

Creating a clean, maintainable and powerful dataset can be a real challenge, especially when collecting data from sources that are evolving. Many changes at the source of the data need to be represented in the format of the data collected, and maintaining compatibility over time can be incredibly time consuming and frustrating - a growing backlog of ETL and data munging, janitorial tasks that have to come before the actual analysis of the data.

If you’re responsible for instrumenting your data sources and determining formats, there are a few helpful heuristics that can prevent these janitorial tasks from taking too much time in the future - but also, many ways of making the data easier to analyze and interact with.

One of the key considerations for designing a healthy dataset is making sure that the values you collect are in the best type for what you’re measuring. By (irresponsibly) oversimplifying the world of computing we can say that there are five key types to use:

Assigning the right type to an attribute can make all the difference in the world: doing analysis across types is painful, time consuming and only gets more so at scale. Having worked with a ton of in-house analytics solutions, we’ve seen first-hand on many occasions how costly a poorly typed attribute can be.

A great example of a poorly typed value is having an attribute Latency that returns a string like “24 ms”, rather than a simple number like “24”. The engineer responsible for this real world example almost certainly thought including the unit of measurement in the value was a nice way of documenting his code, but he had inadvertently shot himself in the foot - it’s unnecessarily complex to ask the average of a string. His string value generated great human-readable logs for his latency checks, but didn’t allow for simple aggregate analysis. In a noSQL environment he’d run a MapReduce to get the answer, and in a SQL environment he’d have to write unnecessarily complex and low-performance queries that would run the risk of crashing the program. If another string like “unknown”, something that couldn’t be meaningfully converted to a number, made it’s way into the mix you’d have a serious headache.

Having strongly typed data makes such undesirable states unrepresentable. This is why many experienced data wizards hate the .csv format - the incredibly helpful type information is not carried between commas. We wrote a schema inference engine and version control system for data models into our platform, Traintracks, to make sure data was always perfectly type-safe, and it allows us to generate interfaces that strictly prohibit malformed questions. Strongly typed data pays for itself every day.

A healthy dataset will contain unique identifiers that make sure you can drill down on specific entities. Although something like username can be a nice thing to store, it’s unique identifiers like userID or sessionID that allow you to isolate specific entities without risk of confusion or bad data. We strongly recommend recording all unique identifiers as strings - even if they look like numbers to the human eye. Experience tells us that most queries you’ll want to run about unique identifiers will not be what the minimum, maximum or average of your userID attribute is - but rather finding userIDs that match, contain or begin with something.

As the latency example above showed, it’s helpful to make sure you record your measurements as pure numbers.

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

Open-Source Deep Learning Frameworks and Visual Analytics

27 Apr, 2017

Deep Learning has been getting more and more traction. It focuses on one section of Machine Learning: Artificial Neural Networks. …

Read more

Achieving new milestones in open analytics

20 Apr, 2016

The foundational role of Apache Hadoop in open analytics ecosystems is undisputed. It is the clear focus of data science, …

Read more

How big data is revolutionizing corporate training

17 Jun, 2017

The days of basing huge business decisions on the “gut feeling” of the CEO are quickly disappearing. Instead, many businesses …

Read more

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.