Creating a clean, maintainable and powerful dataset can be a real challenge, especially when collecting data from sources that are evolving. Many changes at the source of the data need to be represented in the format of the data collected, and maintaining compatibility over time can be incredibly time consuming and frustrating - a growing backlog of ETL and data munging, janitorial tasks that have to come before the actual analysis of the data.
If you’re responsible for instrumenting your data sources and determining formats, there are a few helpful heuristics that can prevent these janitorial tasks from taking too much time in the future - but also, many ways of making the data easier to analyze and interact with.
One of the key considerations for designing a healthy dataset is making sure that the values you collect are in the best type for what you’re measuring. By (irresponsibly) oversimplifying the world of computing we can say that there are five key types to use:
Assigning the right type to an attribute can make all the difference in the world: doing analysis across types is painful, time consuming and only gets more so at scale. Having worked with a ton of in-house analytics solutions, we’ve seen first-hand on many occasions how costly a poorly typed attribute can be.
A great example of a poorly typed value is having an attribute Latency that returns a string like “24 ms”, rather than a simple number like “24”. The engineer responsible for this real world example almost certainly thought including the unit of measurement in the value was a nice way of documenting his code, but he had inadvertently shot himself in the foot - it’s unnecessarily complex to ask the average of a string. His string value generated great human-readable logs for his latency checks, but didn’t allow for simple aggregate analysis. In a noSQL environment he’d run a MapReduce to get the answer, and in a SQL environment he’d have to write unnecessarily complex and low-performance queries that would run the risk of crashing the program. If another string like “unknown”, something that couldn’t be meaningfully converted to a number, made it’s way into the mix you’d have a serious headache.
Having strongly typed data makes such undesirable states unrepresentable. This is why many experienced data wizards hate the .csv format - the incredibly helpful type information is not carried between commas. We wrote a schema inference engine and version control system for data models into our platform, Traintracks, to make sure data was always perfectly type-safe, and it allows us to generate interfaces that strictly prohibit malformed questions. Strongly typed data pays for itself every day.
A healthy dataset will contain unique identifiers that make sure you can drill down on specific entities. Although something like username can be a nice thing to store, it’s unique identifiers like userID or sessionID that allow you to isolate specific entities without risk of confusion or bad data. We strongly recommend recording all unique identifiers as strings - even if they look like numbers to the human eye. Experience tells us that most queries you’ll want to run about unique identifiers will not be what the minimum, maximum or average of your userID attribute is - but rather finding userIDs that match, contain or begin with something.
As the latency example above showed, it’s helpful to make sure you record your measurements as pure numbers.