SparkOscope gives developers more complete insights into Apache Spark

SparkOscope gives developers more complete insights into Apache Spark

During the last year, the High Performance Systems team for IBM Research, Ireland has been using Apache Spark to perform analytics on large volumes of sensor data. The Spark applications it has developed revolve around cleansing, filtering and ingesting historical data. These applications must be executed daily; therefore, it was essential for the team to understand Spark resource utilization—and thus, provide customers with optimal time to insight and cost control through better-calculated infrastructure needs.

Presently, the conventional way of identifying bottlenecks in a Spark application include inspecting the Spark Web UI either throughout the duration of the jobs/staging execution or the postmortem, which is limited to job-level application metrics reported by the Spark built-in metric system (for example, stage completion time). The current version of the Spark metric system supports recording the values of the metrics in local CSV files and also integrating with external metrics systems such as Ganglia.

The team found it cumbersome to manually consume and efficiently inspect these CSV files generated at the Spark worker nodes. Although using an external monitoring system such as Ganglia would automate this process, the team was still plagued with the inability to derive temporal associations between system-level metrics such as CPU utilization and job-level metrics as reported by Spark (for example, job or stage ID). For instance, the team could not trace the root cause of a peak in HDFS reads or CPU usage to the code in the Spark application causing the bottleneck.

To overcome these limitations, IBM developed SparkOscope. This tool takes advantage of the job-level information available through the existing Spark Web UI and minimizes source code pollution by using the current Spark Web UI to monitor and visualize job-level metrics of a Spark application, such as completion time. More importantly, it extends the Web UI with a palette of system-level metrics about the server, virtual machine or container that each of the Spark job’s executor ran on.

 

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

McKinsey’s State Of Machine Learning And AI, 2017

3 Aug, 2017

Tech giants including Baidu and Google spent between $20B to $30B on AI in 2016, with 90% of this spent …

Read more

In-memory technology: Serving up application data to users on the go

8 Oct, 2018

Data can be viewed to be as valuable in the digital business world as oil is to the economy. Oil …

Read more

Why You Need a Fully Automated Data Pipeline

1 Sep, 2020

The four main reasons to implement a fully automated data pipeline are: When you think about the core technologies that …

Read more

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.