The Key to Building Data Pipelines for Machine Learning

The Key to Building Data Pipelines for Machine Learning

As a consumer of goods and services, you experience the results of machine learning (ML) whenever the institutions you rely on use ML processes to run their operations. You may receive a text message from a bank requiring verification after the bank has paused a credit card transaction. Or, an online travel site may send you an email that offers personalized accommodations for your next personal or business trip.

The work that happens behind the scenes to facilitate these experiences can be difficult to fully realize or appreciate. An important portion of that work is done by the data engineering teams that build the data pipelines to help train and deploy those ML models. Once focused on building pipelines to support traditional data warehouses, today’s data engineering teams now build more technically demanding continuous data pipelines that feed applications with artificial intelligence and ML algorithms. These data pipelines must be cost-effective, fast, and reliable regardless of the type of workload and use case.

Due to the diversity of data sources and the volume of data that needs to be processed, traditional data processing tools fail to meet the performance and reliability requirements for modern machine learning applications. The need to build reliable pipelines for these workloads coupled with advances in distributed high performance computing has given rise to big data processing engines such as Hadoop.

Let’s quickly review the different engines and frameworks often used in data engineering aimed at supporting ML efforts:

While the typical steps of data engineering may be standardized across industries — data exploration, building pipelines, data orchestration, and delivery — organizations have various workload types and data engineering needs. That’s where support for multiple engines and workloads becomes essential. Let’s talk about which engines are most effective for each stage of the data engineering cycle:

Data engineering always starts with data exploration. Data exploration involves inspecting the data to understand its characteristics and what it represents. The learnings acquired during this stage will impact the amount and type of work that the data scientist will conduct during their data preparation phase.

Hadoop/Hive is an excellent choice for data exploration of larger unstructured data sets, because of its inexpensive storage and its compatibility with SQL. Spark works well for data sets that require a programmatic approach, such as with file formats widely used in healthcare insurance processing. On the other hand, Presto provides a quick and easy way to access data from a variety of sources using the industry standard SQL query language.

Data pipelines carry and process data from data sources to the business intelligence (BI) and ML applications that take advantage of it.

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

Why the Data Scientist and Data Engineer Need to Understand Virtualization in the Cloud

31 Jan, 2017

This article covers the value of understanding the virtualization constructs for the data scientist and data engineer as they deploy …

Read more

Key steps to success with self-service analytics

10 Aug, 2018

As data—and its applications—grow exponentially, the appeal of self-service analytics can grow with it. Coursing through your business’ veins like …

Read more

Multi-cloud, edge computing and stream processing investments take center stage

11 Jan, 2020

As we enter 2020, companies will have to address new and persistent digital challenges that affect system and business performance. …

Read more

Recent Jobs

IT Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Data Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Applications Developer

Washington D.C., DC, USA

1 May, 2024

Read More

D365 Business Analyst

South Bend, IN, USA

22 Apr, 2024

Read More

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.