In this installment of our series about how quality assurance is done at Cloudera, learn about the important role of fault injection in the overall process.
Apache Hadoop is the consummate example of a scalable distributed system (SDS); such systems are designed to provide 24/7 services reliably and to scale elastically with the addition of industry-standard hardware cost-effectively. They must be resilient and fault-tolerant to various environmental anomalies.
As you would expect, the software layer in Hadoop that provides risk-aware management is quite complex. Thus, for customers running Hadoop in production, severe system-level design flaws can potentially stay hidden even after deployment—risking the derailment of services, loss of data, and possibly even loss of revenue or customer satisfaction. Testing of SDSs is not only costly but also technically challenging.
As described in the first installment of this multipart series, Cloudera pursues continuous improvement and verification of our distribution of the Hadoop ecosystem (CDH) via an extensive QA process during the software-development life cycle. Subjecting these releases to fault injection is one step in that cycle, which also includes unit testing, static/dynamic code analysis, integration testing, system and scale/endurance testing, and finally, validation of real workloads (including customer and partner ones) on internal systems prior to release.
In this post, I will describe the homegrown fault-injection tools and elastic-partitioning techniques used internally against most of the services shipped in CDH, which have already proven useful for finding unknown bugs in releases before shipment to customers.
Chaos Monkey, which was invented and open-sourced by Netflix, is perhaps the best-known example of a fault-injection tool for testing an SDS. Although Cloudera Engineering strongly considered using “plain-vanilla” Chaos Monkey when this effort began, eventually, we determined that writing our own similar tools would better meet long-term requirements, including the abilities to:
These internally developed tools, called AgenTEST and Sapper, in combination serve as our “Swiss army knife” for fault injections: AgenTEST knows how to inject (does the dirty job of starting and stopping the injections), while Sapper is in charge of when and what to inject. The following figure illustrates how AgenTEST and Sapper work together and in combination with elastic partitioning (more about that later):
Next, I’ll describe how these tools work.
AgenTEST runs as a daemon service on each node of the cluster. It monitors a pre-configured folder waiting for instructions about when and what to inject. In particular, those directions are fed using files. Every time that a new file is added to the folder an injection is started; whenever the file is removed, the injection is stopped. The injection to apply is encoded in the name of the file. For example, if the file is created, AgenTEST will send a “port-not-reacheable” message for every packet sent to the address 10.0.0.1 at port 3060.
AgenTEST is able to achieve different type of injections, ranging from network to disk, CPU, memory, and so on. Currently, it supports the following fault-injection operations:
Moreover, we are currently testing new injections such as “commission nodes” and “de-commission nodes.”
Sapper (a “sapper” is the colloquial term for a combat engineer) is the only tool we have that has awareness of the service (e.g. HDFS) and role (e.g. NameNode, DataNode) being tested.
To use Sapper, the tester has to provide a “policy” and a CDH cluster (not necessarily driven by Cloudera Manager). An example of such a policy is shown below.
This policy tells Sapper to target Apache Solr, and in particular all nodes running Solr Server roles. It also says to traverse those nodes in a round-robin fashion at a frequency that is randomly selected in the interval of 5-10 seconds.