How to Approach a Data Intensive Problem

How to Approach a Data Intensive Problem

“It is a capital mistake to theorise before one has data.”

Are you stuck with a problem? Great! Previously I have written a general introduction about predictive functions, and where you might use them for providing "killer" features in your applications. I argued that the data analytics will be a part of the modern software engineering. In both of these disciplines problem solving is an essential skill, and harder are the problems you can crack, the more unique applications you will get. Luckily many of the same good old principles apply. One is a test-driven design that can guide you also through the dark alleys of data modeling mysteries.

A solution for a typical prediction problem can be separated into two parts. There is a prediction model that contains the best concurrent knowledge about the subject area, or the problem domain. Another part is the inference algorithm that allows you to ask questions or queries from that gained knowledge. The prediction model is then a potentially reusable part of the software architecture while typically there is just one inference query for every predictive function.

Obviously, you need to start gaining knowledge before you can ask questions, but before looking, or even collecting the real data, it is essential to understand your problem domain well and formulate good questions. For that, I'll advice you against the Sherlock's quote above to theorise before having the data.

A reason for this precaution is that without a good understanding anything can be explained by the data. The larger amount of data you have, the greater are the misconclusions it allows you to make!

The traditional statistical approach is to have a predefined hypothesis that is either approved or rejected with the evidence from data. The Machine Learning camp is bit more relaxed, but still you need to have separated sets of data for training and testing the model. So, what can you do if you only have a faint idea about the problem and no real data at all?

I am currently working on a problem where we want to statistically study how human body responses to different kinds of food and how to estimate the response personally. You can create some seriously beneficial applications with this knowledge, but in addition to the data related problems listed above, the nutritional diaries and the laboratory results needed for the baseline modeling are highly personal and sensitive, and are also hard to collect in a clinically competent way. This means that you really want to be sure about your methods before you start, so that the data is collected correctly at the first time.

This is where the test-driven approach has helped me a lot. The idea is to make a best guess about the features of your data, generate the testing data and then test how well your modeling technique can capture those features. After a successful capture, it is good to tweak the data until you reach the limits of the model and it breaks down. Then you adapt and iterate again. This way you can learn intimate insights from your problem and from your modeling algorithm, be it a proprietary implementation or one of your own.

Next I will walk you through how I started this process. I try to show it also visually, and if you cannot follow all the long-winded theoretical underpinnings, just look at the images and see how things click together.

Here I took some of the actual parameters that are studied and connected them. I assumed that the nutrition parameters at the top row is affecting the lab results at the bottom, and not another way around, to justify the use of directed arrows. Note that the amount of the actual effects, if any, is unknown at this point, and that is what we like to find out.

Let’s look one of these relationships more closely. For my applications it is enough to approximate the relation to be linear.

 

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

The Role of Data Analytics in Digital Business

20 Apr, 2018

When someone tells me they want to do analytics, I say that it is easy. I can explain it in …

Read more

Obstacles to Enterprise AI and How to Overcome Them

30 Apr, 2021

Obstacles to enterprise AI adoption include lack of internal skills and poor communications between data scientists, business staff, and AI …

Read more

Big Data for Humans: The Importance of Data Visualization

11 May, 2017

Everyone has heard the old moniker garbage in – garbage out. It is a simple way of saying that machine …

Read more

Recent Jobs

IT Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Data Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Applications Developer

Washington D.C., DC, USA

1 May, 2024

Read More

D365 Business Analyst

South Bend, IN, USA

22 Apr, 2024

Read More

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.