tools-overview

Quality Assurance at Cloudera: Fault Injection and Elastic Partitioning

Quality Assurance at Cloudera: Fault Injection and Elastic Partitioning

In this installment of our series about how quality assurance is done at Cloudera, learn about the important role of fault injection in the overall process.

Apache Hadoop is the consummate example of a scalable distributed system (SDS); such systems are designed to provide 24/7 services reliably and to scale elastically with the addition of industry-standard hardware cost-effectively. They must be resilient and fault-tolerant to various environmental anomalies.

As you would expect, the software layer in Hadoop that provides risk-aware management is quite complex. Thus, for customers running Hadoop in production, severe system-level design flaws can potentially stay hidden even after deployment—risking the derailment of services, loss of data, and possibly even loss of revenue or customer satisfaction. Testing of SDSs is not only costly but also technically challenging.

As described in the first installment of this multipart series, Cloudera pursues continuous improvement and verification of our distribution of the Hadoop ecosystem (CDH) via an extensive QA process during the software-development life cycle. Subjecting these releases to fault injection is one step in that cycle, which also includes unit testing, static/dynamic code analysis, integration testing, system and scale/endurance testing, and finally, validation of real workloads (including customer and partner ones) on internal systems prior to release.

Read Also:
The too-smart city

In this post, I will describe the homegrown fault-injection tools and elastic-partitioning techniques used internally against most of the services shipped in CDH, which have already proven useful for finding unknown bugs in releases before shipment to customers.

Chaos Monkey, which was invented and open-sourced by Netflix, is perhaps the best-known example of a fault-injection tool for testing an SDS. Although Cloudera Engineering strongly considered using “plain-vanilla” Chaos Monkey when this effort began, eventually, we determined that writing our own similar tools would better meet long-term requirements, including the abilities to:

These internally developed tools, called AgenTEST and Sapper, in combination serve as our “Swiss army knife” for fault injections: AgenTEST knows how to inject (does the dirty job of starting and stopping the injections), while Sapper is in charge of when and what to inject. The following figure illustrates how AgenTEST and Sapper work together and in combination with elastic partitioning (more about that later):

Next, I’ll describe how these tools work.

Read Also:
The 10 Commandments for data driven leaders

AgenTEST runs as a daemon service on each node of the cluster. It monitors a pre-configured folder waiting for instructions about when and what to inject. In particular, those directions are fed using files. Every time that a new file is added to the folder an injection is started; whenever the file is removed, the injection is stopped. The injection to apply is encoded in the name of the file. For example, if the file is created, AgenTEST will send a “port-not-reacheable” message for every packet sent to the address 10.0.0.1 at port 3060.

AgenTEST is able to achieve different type of injections, ranging from network to disk, CPU, memory, and so on. Currently, it supports the following fault-injection operations:

Moreover, we are currently testing new injections such as “commission nodes” and “de-commission nodes.”

Sapper (a “sapper” is the colloquial term for a combat engineer) is the only tool we have that has awareness of the service (e.g. HDFS) and role (e.g. NameNode, DataNode) being tested.

To use Sapper, the tester has to provide a “policy” and a CDH cluster (not necessarily driven by Cloudera Manager). An example of such a policy is shown below.

Read Also:
Big Data Will Rule Your Home

This policy tells Sapper to target Apache Solr, and in particular all nodes running Solr Server roles. It also says to traverse those nodes in a round-robin fashion at a frequency that is randomly selected in the interval of 5-10 seconds.

 



Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
Why Is Artificial Intelligence So Bad At Empathy?

AI Paris

6
Jun
2017
AI Paris

20% off with code AIP17-7WDATA-20

Read Also:
Why Is Artificial Intelligence So Bad At Empathy?

Chief Data Officer Summit San Francisco

7
Jun
2017
Chief Data Officer Summit San Francisco

$200 off with code DATA200

Read Also:
Defining a Data Risk Strategy for an Organization

Customer Analytics Innovation Summit Chicago

7
Jun
2017
Customer Analytics Innovation Summit Chicago

$200 off with code DATA200

Read Also:
Artificial intelligence: could pharma lead the way?

Big Data and Analytics Marketing Summit London

12
Jun
2017
Big Data and Analytics Marketing Summit London

$200 off with code DATA200

Read Also:
Putting The Business Back Into Business Intelligence: How BI Can Grow Your Bottom Line

Leave a Reply

Your email address will not be published. Required fields are marked *