Qunar Performs Real-Time Data Analytics up to 300x Faster with Alluxio

Qunar Performs Real-Time Data Analytics up to 300x Faster with Alluxio

Qunar Performs Real-Time Data Analytics up to 300x Faster with Alluxio
Qunar Performs Real-Time Data Analytics up to 300x Faster with Alluxio
We would like to thank Xueyan Li, Lei Xu, and Xiaoxu Lv from Qunar for contributing this guest blog.
 
At Qunar , we have been running Alluxio in production for over 9 months, resulting in 15x speedup on average, and 300x speedup at peak service times. In addition, Alluxio’s unified namespace enables different applications and frameworks to easily interact with our data from different storage systems.
Introduction
Real-time data analytics is becoming increasing important for Internet companies like Qunar , the leading travel search engine in China. Alluxio, a memory speed virtual distributed storage system, plays an important role in the big data ecosystem, and brings great performance improvements to applications. In this article, we present how Alluxio is used as the storage management layer in Qunar’s stream processing platform, and how we use Alluxio to improve performance. At Qunar, we have been running Alluxio in production for over 9 months, and have observed 15x performance improvement on average, and 300x improvement at peak service times.
At Qunar, the streaming platform processes around 6 billion system log entries (4.5 TB) daily. Many jobs running on the platform are business critical, and therefore impose strict requirements on both stability and low latency. For example, our real-time user recommendations are primarily generated based on the log analysis of user’s click and search behavior. Faster analysis delivers more accurate feedback to the users. Therefore low latency and high stability are the top priorities of our system.
Our real-time analytics architecture includes many technologies including Mesos, Spark Streaming, Flink, HDFS, and now Alluxio. The architecture is described in more detail in the following section. The benefits that Alluxio brings to our system include:
Alluxio’s tiered storage feature manages various storage resources including memory, SSD and disk.
Multiple computing frameworks can share data at memory speed with Alluxio.
Alluxio’s unified namespace manages remote storage systems and provides a unified interface for applications, making it possible to access different storage systems from different applications and frameworks.
Alluxio provides various user-friendly APIs, which simplifies the transition to Alluxio.
Alluxio enables some large jobs to finish, by storing data in Alluxio memory instead of the application memory.
The remaining part of this blog will compare the stream processing architecture before and after deploying Alluxio. At the end, we will briefly mention our future plans with Alluxio.
Original Architecture and its Problems
Our real-time stream processing system uses Mesos for cluster management, and uses it to manage Spark, Flink, Logstash, and Kibana.
As shown in the figure above, logs are from multiple sources and consolidated by Kafka. Everyday, the business application generates about 6 billion log entries. The main computing frameworks, Spark streaming and Flink, subscribe to the data in Kafka, process the data, and persist results to HDFS. This architecture has the following bottlenecks:
The HDFS storage system storing the input and output data is located in a remote cluster, introducing large network latency. Data exchange becomes one of the bottlenecks in the stream processing system.
HDFS uses spinning disks, so I/O operations, especially write operations, have high latency. Spark streaming executors need to read from HDFS and the repetitive cross-cluster read operations further decreases the overall performance.
When a Spark streaming executor runs out of memory and fails, it will be restarted on a different node, and won’t be able to reuse the checkpoint information from the previous node. Therefore, certain jobs can never finish.
When Spark streaming tasks persist data using MEMORY_ONLY there is replicated data in JVM memory, causing garbage collection issues and sometimes failures.

Read Also:
Fintech: Blockchain And Artificial Intelligence Startups Are Flocking To HEC Paris — Here's Why

Read Full Story…

 

Leave a Reply

Your email address will not be published. Required fields are marked *