5 Data Management Mistakes to Avoid during Data Integration Projects

5 Data Management Mistakes to Avoid during Data Integration Projects

Canada based Maira Bay de Souza of Product Data Lake Technologies shares her view on data integration and the mistakes to avoid doing that:

Throughout my 5 years of working with Data Integration, Data Migration and Data Architecture, I’ve noticed some common (but sometimes serious) mistakes related to Data Management and Software Quality Management. I hope that by reading about them you will be able to avoid them in your future Data Integration projects.

Defining the Data Architecture in a Data Integration project is the equivalent of defining the Requirements in a normal (non-data-oriented) software project. A normal software application is (most of the times) defined by its actions and interactions with the user. That’s why, in the first phase of software development (the Requirements Phase), one of the key steps is creating Use-Cases (or User Stories). On the other hand, a Data Integration application is defined by its operations on datasets. Interacting with data structures is at the core of its functionality. Therefore, we need to have a clear picture of what these data structures look like in order to define what operations we will do on them.

 It is widely accepted in normal software development that having well-defined requirements is key to success. The common saying “If you don’t know where you’re going, any road will get you there” also applies for Data Integration applications. When ETL developers don’t have a clear definition of the Data Architecture they’re working with, they will inevitably make assumptions. Those assumptions might not always be the same as the ones you, or worse, your customer made.

Simple but detailed questions like “can this field be null or not?” need to be answered. If the wrong decision is made, it can have serious consequences. Most senior Java programmers like me are well aware of the infamous “Null Pointer Exception“. If you feed a null value to a variable that doesn’t accept null (but you don’t know that that’s the case because you’ve never seen any architecture specification), you will get that error message.

Because it is a vague message, it can be time-consuming to debug and find the root cause (especially for junior programmers): you have to open your ETL in the IDE, go to the code view, find the line of code that is causing the problem (sometimes you might even have to run the ETL yourself), then find where that variable is located in the design view of your IDE, add a fix there, test it to make sure it’s working and then deploy it in production again. That also means that normally, this error causes an ETL application to stop functioning altogether (unless there is some sort of error handling). Depending on your domain that can have serious, life-threatening consequences (for example, healthcare or aviation), or lead to major financial losses (for example, e-commerce).

Knowing the format, boundaries, constraints, relationships and other information about your data is imperative to developing a high quality Data Integration application. Taking the time to define the Data Architecture will prevent a lot of problems down the road.

Data profiling is another key element to developing good Data Integration applications.

When doing data profiling, most ETL developers look at the current dataset in front of them, and develop the ETL to clean and process the data in that dataset. But unfortunately that is not enough. It is important to also think about how the dataset might change over time.

For example, let’s say we find  a customer in our dataset with the postal code in the city field. We then add an instruction in the ETL for when we find that specific customer’s data, to extract the postal code from the city field and put it in the postal code field.

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

Creating The Most Sophisticated Recommendations Using Native Graphs

7 Mar, 2018

We all appreciate Amazon’s ‘Customers who bought this like you, also bought this’ and Netflix’s’ ‘Customers who watched this, also …

Read more

Deliver Business Value with Predictive Workforce Analytics

22 Jun, 2016

Today’s business executives are increasingly applying pressure to their Human Resources departments to “use predictive analytics.” But this pressure isn’t …

Read more

With DataPlane, Hortonworks aims to help companies drowning in data lakes

2 Oct, 2017

When the term “data lake” was coined in 2011, the notion was that organizations needed a single pool for all …

Read more

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.