“It is a capital mistake to theorise before one has data.”
Are you stuck with a problem? Great! Previously I have written a general introduction about predictive functions, and where you might use them for providing “killer” features in your applications. I argued that the data analytics will be a part of the modern software engineering. In both of these disciplines problem solving is an essential skill, and harder are the problems you can crack, the more unique applications you will get. Luckily many of the same good old principles apply. One is a test-driven design that can guide you also through the dark alleys of data modeling mysteries.
A solution for a typical prediction problem can be separated into two parts. There is a prediction model that contains the best concurrent knowledge about the subject area, or the problem domain. Another part is the inference algorithm that allows you to ask questions or queries from that gained knowledge. The prediction model is then a potentially reusable part of the software architecture while typically there is just one inference query for every predictive function.
Obviously, you need to start gaining knowledge before you can ask questions, but before looking, or even collecting the real data, it is essential to understand your problem domain well and formulate good questions. For that, I’ll advice you against the Sherlock’s quote above to theorise before having the data.
A reason for this precaution is that without a good understanding anything can be explained by the data. The larger amount of data you have, the greater are the misconclusions it allows you to make!
The traditional statistical approach is to have a predefined hypothesis that is either approved or rejected with the evidence from data. The machine learning camp is bit more relaxed, but still you need to have separated sets of data for training and testing the model. So, what can you do if you only have a faint idea about the problem and no real data at all?
I am currently working on a problem where we want to statistically study how human body responses to different kinds of food and how to estimate the response personally. You can create some seriously beneficial applications with this knowledge, but in addition to the data related problems listed above, the nutritional diaries and the laboratory results needed for the baseline modeling are highly personal and sensitive, and are also hard to collect in a clinically competent way. This means that you really want to be sure about your methods before you start, so that the data is collected correctly at the first time.
This is where the test-driven approach has helped me a lot. The idea is to make a best guess about the features of your data, generate the testing data and then test how well your modeling technique can capture those features. After a successful capture, it is good to tweak the data until you reach the limits of the model and it breaks down. Then you adapt and iterate again. This way you can learn intimate insights from your problem and from your modeling algorithm, be it a proprietary implementation or one of your own.
Next I will walk you through how I started this process. I try to show it also visually, and if you cannot follow all the long-winded theoretical underpinnings, just look at the images and see how things click together.
Here I took some of the actual parameters that are studied and connected them. I assumed that the nutrition parameters at the top row is affecting the lab results at the bottom, and not another way around, to justify the use of directed arrows. Note that the amount of the actual effects, if any, is unknown at this point, and that is what we like to find out.
Let’s look one of these relationships more closely. For my applications it is enough to approximate the relation to be linear.