How to Approach a Data Intensive Problem Blog

How to Approach a Data Intensive Problem

by 7wData
November 9, 2016

“It is a capital mistake to theorise before one has data.”

Are you stuck with a problem? Great! Previously I have written a general introduction about predictive functions, and where you might use them for providing "killer" features in your applications. I argued that the data analytics will be a part of the modern software engineering. In both of these disciplines problem solving is an essential skill, and harder are the problems you can crack, the more unique applications you will get. Luckily many of the same good old principles apply. One is a test-driven design that can guide you also through the dark alleys of data modeling mysteries.

A solution for a typical prediction problem can be separated into two parts. There is a prediction model that contains the best concurrent knowledge about the subject area, or the problem domain. Another part is the inference algorithm that allows you to ask questions or queries from that gained knowledge. The prediction model is then a potentially reusable part of the software architecture while typically there is just one inference query for every predictive function.

Obviously, you need to start gaining knowledge before you can ask questions, but before looking, or even collecting the real data, it is essential to understand your problem domain well and formulate good questions. For that, I'll advice you against the Sherlock's quote above to theorise before having the data.

A reason for this precaution is that without a good understanding anything can be explained by the data. The larger amount of data you have, the greater are the misconclusions it allows you to make!

The traditional statistical approach is to have a predefined hypothesis that is either approved or rejected with the evidence from data. The Machine Learning camp is bit more relaxed, but still you need to have separated sets of data for training and testing the model. So, what can you do if you only have a faint idea about the problem and no real data at all?

I am currently working on a problem where we want to statistically study how human body responses to different kinds of food and how to estimate the response personally. You can create some seriously beneficial applications with this knowledge, but in addition to the data related problems listed above, the nutritional diaries and the laboratory results needed for the baseline modeling are highly personal and sensitive, and are also hard to collect in a clinically competent way. This means that you really want to be sure about your methods before you start, so that the data is collected correctly at the first time.

This is where the test-driven approach has helped me a lot. The idea is to make a best guess about the features of your data, generate the testing data and then test how well your modeling technique can capture those features. After a successful capture, it is good to tweak the data until you reach the limits of the model and it breaks down. Then you adapt and iterate again. This way you can learn intimate insights from your problem and from your modeling algorithm, be it a proprietary implementation or one of your own.

Next I will walk you through how I started this process. I try to show it also visually, and if you cannot follow all the long-winded theoretical underpinnings, just look at the images and see how things click together.

Here I took some of the actual parameters that are studied and connected them. I assumed that the nutrition parameters at the top row is affecting the lab results at the bottom, and not another way around, to justify the use of directed arrows. Note that the amount of the actual effects, if any, is unknown at this point, and that is what we like to find out.

Let’s look one of these relationships more closely. For my applications it is enough to approximate the relation to be linear.

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

How to Approach a Data Intensive Problem

Leave a Reply Cancel reply

Upcoming Events

MarkLogic World | Amsterdam

Knowledge Graph — The Ultimate Center of Excellence

From Text to Value: Pairing Text Analytics and Generative AI

Bringing Data Closer to Decision Makers with Data Fabric

Categories

Tags

You Might Be Interested In

The Role of Data Analytics in Digital Business

Obstacles to Enterprise AI and How to Overcome Them

Big Data for Humans: The Importance of Data Visualization

Recent Jobs

IT Engineer

Data Engineer

Applications Developer

D365 Business Analyst

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

How to Approach a Data Intensive Problem

Leave a Reply Cancel reply

Upcoming Events

Categories

Tags

You Might Be Interested In

Recent Jobs

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

To Drive Analytics Adoption
And manage change