Quality Assurance at Cloudera: Fault Injection and Elastic Partitioning

Quality Assurance at Cloudera: Fault Injection and Elastic Partitioning

In this installment of our series about how quality assurance is done at Cloudera, learn about the important role of fault injection in the overall process.

Apache Hadoop is the consummate example of a scalable distributed system (SDS); such systems are designed to provide 24/7 services reliably and to scale elastically with the addition of industry-standard hardware cost-effectively. They must be resilient and fault-tolerant to various environmental anomalies.

As you would expect, the software layer in Hadoop that provides risk-aware management is quite complex. Thus, for customers running Hadoop in production, severe system-level design flaws can potentially stay hidden even after deployment—risking the derailment of services, loss of data, and possibly even loss of revenue or customer satisfaction. Testing of SDSs is not only costly but also technically challenging.

As described in the first installment of this multipart series, Cloudera pursues continuous improvement and verification of our distribution of the Hadoop ecosystem (CDH) via an extensive QA process during the software-development life cycle. Subjecting these releases to fault injection is one step in that cycle, which also includes unit testing, static/dynamic code analysis, integration testing, system and scale/endurance testing, and finally, validation of real workloads (including customer and partner ones) on internal systems prior to release.

Read Also:
Artificial intelligence will soon replace the human mind

In this post, I will describe the homegrown fault-injection tools and elastic-partitioning techniques used internally against most of the services shipped in CDH, which have already proven useful for finding unknown bugs in releases before shipment to customers.

Chaos Monkey, which was invented and open-sourced by Netflix, is perhaps the best-known example of a fault-injection tool for testing an SDS. Although Cloudera Engineering strongly considered using “plain-vanilla” Chaos Monkey when this effort began, eventually, we determined that writing our own similar tools would better meet long-term requirements, including the abilities to:

These internally developed tools, called AgenTEST and Sapper, in combination serve as our “Swiss army knife” for fault injections: AgenTEST knows how to inject (does the dirty job of starting and stopping the injections), while Sapper is in charge of when and what to inject. The following figure illustrates how AgenTEST and Sapper work together and in combination with elastic partitioning (more about that later):

Read Also:
The opportunities Silicon Valley doesn’t see

Next, I’ll describe how these tools work.

AgenTEST runs as a daemon service on each node of the cluster. It monitors a pre-configured folder waiting for instructions about when and what to inject. In particular, those directions are fed using files. Every time that a new file is added to the folder an injection is started; whenever the file is removed, the injection is stopped. The injection to apply is encoded in the name of the file. For example, if the file is created, AgenTEST will send a “port-not-reacheable” message for every packet sent to the address at port 3060.

AgenTEST is able to achieve different type of injections, ranging from network to disk, CPU, memory, and so on. Currently, it supports the following fault-injection operations:

Moreover, we are currently testing new injections such as “commission nodes” and “de-commission nodes.”

Sapper (a “sapper” is the colloquial term for a combat engineer) is the only tool we have that has awareness of the service (e.g. HDFS) and role (e.g. NameNode, DataNode) being tested.

Read Also:
Charting the data lake: Rethinking data models for data lakes

To use Sapper, the tester has to provide a “policy” and a CDH cluster (not necessarily driven by Cloudera Manager). An example of such a policy is shown below.

This policy tells Sapper to target Apache Solr, and in particular all nodes running Solr Server roles. It also says to traverse those nodes in a round-robin fashion at a frequency that is randomly selected in the interval of 5-10 seconds.


Chief Data Officer Europe
20 Feb

15% off with code CDO7W17

Read Also:
Can Big Data operations be manageable? Two companies say yes.
Predictive Analytics Innovation summit San Diego
22 Feb

$200 off with code DATA200

Read Also:
DDS solves major big data IoT problems
Read Also:
Applications of Predictive Analytics in various industries
Big Data Paris 2017
6 Mar
Big Data Paris 2017

15% off with code BDP17-7WDATA

Read Also:
How Big Data helps banks know their customers better -

Leave a Reply

Your email address will not be published. Required fields are marked *