Data Preparation Tips, Tricks, and Tools Blog

Data Preparation Tips, Tricks, and Tools

by 7wData
October 18, 2016

Data preparation and preprocessing tasks constitute a high percentage of any data-centric operation. In order to provide some insight, we have asked a pair of experts to answer a few questions on the subject.

No matter how you view it, data preparation and preprocessing tasks constitute a high percentage of any data-centric operation, be it of a descriptive or predictive nature. It can also be a collection of the most frustrating tasks for a data practitioner, though this often is driven by a fundamental lack of understanding of the importance of preparatory measures. Whether from academia or industry, data preparation is the one thing we all have in common; the great equalizer, if you will.

In an effort to shed some light on the importance and ubiquity of data preparation, we have asked a trio of experts to provide some insight into the subject:

Sebastian Raschka is a 'Data Scientist' (his quotes) and Machine Learning enthusiast, author of the authoritative book 'Python Machine Learning,' and a PhD candidate in Computational Biology at Michigan State University.

Clare Bernard holds a PhD in Experimental High-Energy Physics and is a Product Lead at Tamr, an organization which "transforms dark, dirty, and disparate data into clean, connected data that can be delivered quickly and repeatedly throughout your organization."

Joe Boutros is a computer scientist and Director of Product Engineering at data.world, which aims to build "the most meaningful, collaborative, and abundant data resource in the world."

What follows are the answers our experts provided to a few somewhat open-ended questions related to the data preparation process. (Keep in mind that responses from our experts are to the question being posed, and not to the previous response(s), as they were solicited separately and compiled after-the-fact.)

Matthew Mayo: Why is it that data preparation is often described as 80% of the work involved in data-related tasks, and do you think this is an accurate generalization?

Sebastian Raschka: 80%? I often hear >90%! Joking aside, I think it's really true that data preparation makes up most of the work in typical data-related projects.

For simplicity, let's use "data preparation" as a category that summarizes tasks such as data acquisition, data storage and handling, data cleaning, and maybe even early-stages of feature engineering.

First, we start with the question we want to answer, or a problem that we want to solve. And in order to address this problem, we typically -- not always -- need to *get* the data! This means that we have to search and ask for datasets and resources that are relevant, trustworthy, relatively up to date (depending on the task), and in a format that we may be able to work with. If we are lucky, there are APIs or maybe even curated datasets out there. For instance, for a soccer-prediction hobby project, I ended up writing cron-tab powered Python web scrapers for dozens of websites.

Now that we have our data, we may want to do several sanity checks: checking for missing data, formatting issues, or other problems. Usually, we have to come back to this step a couple of times during our data exploration stage ...

Often, we do not only collect data from a single resource, and we have to come up with ways to combine data in a meaningful way. Back to my soccer prediction example, one particular challenge was that each website spelled the (English Premier League) soccer clubs or players differently.

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Data Preparation Tips, Tricks, and Tools

Leave a Reply Cancel reply

Upcoming Events

MarkLogic World | Amsterdam

Knowledge Graph — The Ultimate Center of Excellence

From Text to Value: Pairing Text Analytics and Generative AI

Bringing Data Closer to Decision Makers with Data Fabric

Categories

Tags

You Might Be Interested In

Data Science: What it is and how it Impacts Sales Enablement Technology

4 Ways Trillion Dollar Big Data will Affect Business

Four critical data management attributes for AI and digital

Recent Jobs

Senior Cloud Engineer (AWS, Snowflake)

IT Engineer

Data Engineer

Applications Developer

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

Data Preparation Tips, Tricks, and Tools

Leave a Reply Cancel reply

Upcoming Events

Categories

Tags

You Might Be Interested In

Recent Jobs

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

To Drive Analytics Adoption
And manage change