The Key to Building Data Pipelines for Machine Learning Blog

The Key to Building Data Pipelines for Machine Learning

by 7wData
February 15, 2021

As a consumer of goods and services, you experience the results of machine learning (ML) whenever the institutions you rely on use ML processes to run their operations. You may receive a text message from a bank requiring verification after the bank has paused a credit card transaction. Or, an online travel site may send you an email that offers personalized accommodations for your next personal or business trip.

The work that happens behind the scenes to facilitate these experiences can be difficult to fully realize or appreciate. An important portion of that work is done by the data engineering teams that build the data pipelines to help train and deploy those ML models. Once focused on building pipelines to support traditional data warehouses, today’s data engineering teams now build more technically demanding continuous data pipelines that feed applications with artificial intelligence and ML algorithms. These data pipelines must be cost-effective, fast, and reliable regardless of the type of workload and use case.

Due to the diversity of data sources and the volume of data that needs to be processed, traditional data processing tools fail to meet the performance and reliability requirements for modern machine learning applications. The need to build reliable pipelines for these workloads coupled with advances in distributed high performance computing has given rise to big data processing engines such as Hadoop.

Let’s quickly review the different engines and frameworks often used in data engineering aimed at supporting ML efforts:

While the typical steps of data engineering may be standardized across industries — data exploration, building pipelines, data orchestration, and delivery — organizations have various workload types and data engineering needs. That’s where support for multiple engines and workloads becomes essential. Let’s talk about which engines are most effective for each stage of the data engineering cycle:

Data engineering always starts with data exploration. Data exploration involves inspecting the data to understand its characteristics and what it represents. The learnings acquired during this stage will impact the amount and type of work that the data scientist will conduct during their data preparation phase.

Hadoop/Hive is an excellent choice for data exploration of larger unstructured data sets, because of its inexpensive storage and its compatibility with SQL. Spark works well for data sets that require a programmatic approach, such as with file formats widely used in healthcare insurance processing. On the other hand, Presto provides a quick and easy way to access data from a variety of sources using the industry standard SQL query language.

Data pipelines carry and process data from data sources to the business intelligence (BI) and ML applications that take advantage of it.

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

The Key to Building Data Pipelines for Machine Learning

Leave a Reply Cancel reply

Upcoming Events

MarkLogic World | Amsterdam

Knowledge Graph — The Ultimate Center of Excellence

From Text to Value: Pairing Text Analytics and Generative AI

Bringing Data Closer to Decision Makers with Data Fabric

Categories

Tags

You Might Be Interested In

Why the Data Scientist and Data Engineer Need to Understand Virtualization in the Cloud

Key steps to success with self-service analytics

Multi-cloud, edge computing and stream processing investments take center stage

Recent Jobs

IT Engineer

Data Engineer

Applications Developer

D365 Business Analyst

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

The Key to Building Data Pipelines for Machine Learning

Leave a Reply Cancel reply

Upcoming Events

Categories

Tags

You Might Be Interested In

Recent Jobs

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

To Drive Analytics Adoption
And manage change