SQL Performance: Part 1 - Input File Formats

08 Aug 2018, this is a blog post from a user of the Crail project.

This is the first user blog post in a multi-part series where we will focus on relational data processing performance (e.g., SQL) in presence of high-performance network and storage devices - the kind of devices that Crail targets. Relational data processing is one of the most popular and versatile workloads people run in the cloud. The general idea is that data is stored in tables with a schema, and is processed using a domain specific language like SQL. Examples of some popular systems that support such relational data analytics in the cloud are Apache Spark/SQL, Apache Hive, Apache Impala, etc. In this post, we discuss the important first step in relational data processing, which is the reading of input data tables.

Hardware and Software Configuration

The specific cluster configuration used for the experiments in this blog:

Overview

In a typical cloud-based relational data processing setup, the input data is stored on an external data storage solution like HDFS or AWS S3. Data tables and their associated schema are converted into a storage-friendly format for optimal performance. Examples of some popular and familiar file formats are Apache Parquet, Apache ORC, Apache Avro, JSON, etc. More recently, Apache Arrow has been introduced to standardize the in-memory columnar data representation between multiple frameworks. To be precise, Arrow is not a storage format but it defines an interprocess communication (IPC) format that can be used to store data in a stroage system (our binding for reading Arrow IPC messages from HDFS is available here). There is no one size fits all as all these formats have their own strengths, weaknesses, and features. In this blog, we are specifically interested in the performance of these formats on modern high-performance networking and storage devices.

Figure 1: The benchmarking setup with HDFS and file formats on a 100 Gbps network with NVMe flash devices. All formats contains routines for compression, encoding, and value materialization with associated I/O buffer management and data copies routines.

To benchmark the performance of file formats, we wrote a set of micro-benchmarks which are available at https://github.com/zrlio/fileformat-benchmarks. We cannot use typical SQL micro-benchmarks because every SQL engine has its own favorite file format, on which it performs the best. Hence, in order to ensure parity, we decoupled the performance of reading the input file format from the SQL query processing by writing simple table reading micro-benchmarks. Our benchmark reads in the store_sales table from the TPC-DS dataset (scale factor 100), and calculates a sum of values present in the table. The table contains 23 columns of integers, doubles, and longs.

Figure 2: Performance of JSON, Avro, Parquet, ORC, and Arrow on NVMe devices over a 100 Gbps network.

We evaluate the performance of the benchmark on a 3 node HDFS cluster connected using 100 Gbps RoCE. One datanode in HDFS contains 4 NVMe devices with a collective aggregate bandwidth of 12.5 GB/sec (equals to 100 Gbps, hence, we have a balanced network and storage performance). Figure 2 shows our results where none of the file formats is able to deliver the full hardware performance for reading input files. One third of the performance is already lost in HDFS (maximum throughput 74.9 Gbps out of possible 100 Gbps). The rest of the performance is lost inside the file format implementation, which needs to deal with encoding, buffer and I/O management, compression, etc. The best performer is Apache Arrow which is designed for in-memory columnar datasets. The performance of these file formats are bounded by the performance of the CPU, which is 100% loaded during the experiment. For a detailed analysis of the file formats, please refer to our paper - Albis: High-Performance File Format for Big Data Systems (USENIX, ATC’18). As a side-note on the Arrow performance - we have evaluated the performance of implementation of Arrow’s Java library. As this library has been focused on interactions with off-heap memory, there is a head room for optimizing the HDFS/on-heap reading path of Arrow’s Java library.

Albis: High-Performance File Format for Big Data Systems

Based on these findings, we have developed a new file format called Albis. Albis is built on similar design choices as Crail. The top-level idea is to leverage the performance of modern networking and storage devices without being bottleneck by the CPU. While designing Albis we revisited many outdated assumptions about the nature of I/O in a distributed setting, and came up with the following ideas:

Figure 3: Core scalability of JSON, Avro, Parquet, ORC, Arrow, and Albis on HDFS/NVMe.

Using the Albis format, we revise our previous experiment where we read the input store_sales table from HDFS. In the figure above, we show the performance of Albis and other file formats with number of CPU cores involved. At the right hand of the x-axis, we have performance with all 16 cores engaged, hence, representing the peak possible performance. As evident, Albis delivers 59.9 Gbps out of 74.9 Gbps possible bandwidth with HDFS over NVMe. Albis performance is 1.9 - 21.4x better than other file formats. To give an impression where the performance is coming from, in the table below we show some micro-architectural features for Parquet, ORC, Arrow, and Albis. Our previously discussed design ideas in Albis result in a shorter code path (shown as less instructions required for each row), better cache performance (shows as lower cache misses per row), and clearly better performance (shown as nanoseconds required per row for processing). For a detailed evaluation of Albis please refer to our paper.

Table 1: Micro-architectural analysis for Parquet, ORC, Arrow, and Albis on a 16-core Xeon machine.

Parquet ORC Arrow Albis
Instructions/row 6.6K 4.9K 1.9K 1.6K
Cache misses/row 9.2 4.6 5.1 3.0
Nanoseconds/row 105.3 63.0 31.2 20.8

Apache Crail (Incubating) with Albis

For our final experiment, we try to answer the question what it would take to deliver the full 100 Gbps bandwidth for Albis. Certainly, the first bottleneck is to improve the base storage layer performance. Here we use Apache Crail (Incubating) with its NVMeF storage tier. This tier uses jNVMf library to implement NVMeF stack in Java. As we have shown in a previous blog post that Crail’s NVMeF tier can deliver performance (97.8 Gbps) very close to the hardware limits. Hence, Albis with Crail is a perfect setup to evaluate on high-performance NVMe and RDMA devices. Before we get there, let’s get some calculations right. The store_sales table in the TPC-DS dataset has a data density of 93.9% (out of 100 bytes, only 93.9 is data, others are null values). As we measure the goodput, the expected performance of Albis on Crail is 93.9% of 97.8 Gbps, which calculates to 91.8 Gbps. In our experiments, Albis on Crail delivers 85.5 Gbps. Figure 4 shows more detailed results.

Figure 4: Performance of Albis on Crail.

The left half of the figure shows the performance scalability of Albis on Crail in a setup with 1 core (8.9 Gbps) to 16 cores (85.5 Gbps). In comparison, the right half of the figure shows the performance of Crail on HDFS/NVMe at 59.9 Gbps, and on Crail/NVMe at 85.5 Gbps. The last bar shows the performance of Albis if the benchmark does not materialize Java object values. In this configuration, Albis on Crail delivers 91.3 Gbps, which is very close to the expected peak of 91.8 Gbps.

Summary

In this first blog of a multipart series, we have looked at the data ingestion performance of file formats on high-performance networking and storage devices. We found that popular file formats are in need for a performance revision. Based on our analysis, we designed and implemented Albis - a new file format for storing relational data. Albis and Crail share many design choices. Their combined performance of 85+ Gbps on a 100 Gbps network, gives us confidence in our approach and underlying software philosophy for both, Crail and Albis.

Stay tuned for the next part where we look at workload-level performance in Spark/SQL on modern high-performance networking and storage devices. Meanwhile let us know if you have any feedback or comments.