tools-overview

Quality Assurance at Cloudera: Fault Injection and Elastic Partitioning

Quality Assurance at Cloudera: Fault Injection and Elastic Partitioning

In this installment of our series about how quality assurance is done at Cloudera, learn about the important role of fault injection in the overall process.

Apache Hadoop is the consummate example of a scalable distributed system (SDS); such systems are designed to provide 24/7 services reliably and to scale elastically with the addition of industry-standard hardware cost-effectively. They must be resilient and fault-tolerant to various environmental anomalies.

As you would expect, the software layer in Hadoop that provides risk-aware management is quite complex. Thus, for customers running Hadoop in production, severe system-level design flaws can potentially stay hidden even after deployment—risking the derailment of services, loss of data, and possibly even loss of revenue or customer satisfaction. Testing of SDSs is not only costly but also technically challenging.

As described in the first installment of this multipart series, Cloudera pursues continuous improvement and verification of our distribution of the Hadoop ecosystem (CDH) via an extensive QA process during the software-development life cycle. Subjecting these releases to fault injection is one step in that cycle, which also includes unit testing, static/dynamic code analysis, integration testing, system and scale/endurance testing, and finally, validation of real workloads (including customer and partner ones) on internal systems prior to release.

Read Also:
What Professional Sports Can Teach us About Data-Driven Competitive Advantage

In this post, I will describe the homegrown fault-injection tools and elastic-partitioning techniques used internally against most of the services shipped in CDH, which have already proven useful for finding unknown bugs in releases before shipment to customers.

Chaos Monkey, which was invented and open-sourced by Netflix, is perhaps the best-known example of a fault-injection tool for testing an SDS. Although Cloudera Engineering strongly considered using “plain-vanilla” Chaos Monkey when this effort began, eventually, we determined that writing our own similar tools would better meet long-term requirements, including the abilities to:

These internally developed tools, called AgenTEST and Sapper, in combination serve as our “Swiss army knife” for fault injections: AgenTEST knows how to inject (does the dirty job of starting and stopping the injections), while Sapper is in charge of when and what to inject. The following figure illustrates how AgenTEST and Sapper work together and in combination with elastic partitioning (more about that later):

Next, I’ll describe how these tools work.

Read Also:
Tips for making your data lake thrive

AgenTEST runs as a daemon service on each node of the cluster. It monitors a pre-configured folder waiting for instructions about when and what to inject. In particular, those directions are fed using files. Every time that a new file is added to the folder an injection is started; whenever the file is removed, the injection is stopped. The injection to apply is encoded in the name of the file. For example, if the file is created, AgenTEST will send a “port-not-reacheable” message for every packet sent to the address 10.0.0.1 at port 3060.

AgenTEST is able to achieve different type of injections, ranging from network to disk, CPU, memory, and so on. Currently, it supports the following fault-injection operations:

Moreover, we are currently testing new injections such as “commission nodes” and “de-commission nodes.”

Sapper (a “sapper” is the colloquial term for a combat engineer) is the only tool we have that has awareness of the service (e.g. HDFS) and role (e.g. NameNode, DataNode) being tested.

To use Sapper, the tester has to provide a “policy” and a CDH cluster (not necessarily driven by Cloudera Manager). An example of such a policy is shown below.

Read Also:
Unlocking the Potential of Big Data in Population Health Management

This policy tells Sapper to target Apache Solr, and in particular all nodes running Solr Server roles. It also says to traverse those nodes in a round-robin fashion at a frequency that is randomly selected in the interval of 5-10 seconds.

 



Chief Analytics Officer Europe

25
Apr
2017
Chief Analytics Officer Europe

15% off with code 7WDCAO17

Read Also:
How This Former Microsoft Exec Will Help You Make Sense of Your Company's Data

Chief Analytics Officer Spring 2017

2
May
2017
Chief Analytics Officer Spring 2017

15% off with code MP15

Read Also:
Big Data Misconceptions

Big Data and Analytics for Healthcare Philadelphia

17
May
2017
Big Data and Analytics for Healthcare Philadelphia

$200 off with code DATA200

Read Also:
Artificial intelligence and the evolution of the fractal economy

SMX London

23
May
2017
SMX London

10% off with code 7WDATASMX

Read Also:
What Happens When You Combine Artificial Intelligence and Satellite Imagery

Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
Injecting IoT into Big Data Analytics (Internet Of Things)

Leave a Reply

Your email address will not be published. Required fields are marked *