The Crail Project: Crail Python API: Python -> C/C++ call overhead

With python being used in many machine learning applications, serverless frameworks, etc. as the go-to language, we believe a Crail client Python API would be a useful tool to broaden the use-case for Crail. Since the Crail core is written in Java, performance has always been a concern due to just-in-time compilation, garbage collection, etc. However with careful engineering (Off heap buffers, stateful verbs calls, ...) we were able to show that Crail can devliever similar or better performance compared to other statically compiled storage systems. So how can we engineer the Python library to deliver the best possible performance?

Python's reference implementation, also the most widely-used, CPython has historically always been an interpreter and not a JIT compiler like PyPy. We will focus on CPython since its alternatives are in general not plug-and-play replacements.

Crail is client-driven so most of its logic is implemented in the client library. For this reason we do not want to reimplement the client logic for every new language we want to support as it would result in a maintance nightmare. However interfacing with Java is not feasible since it encurs in to much overhead so we decided to implement a C++ client (more on this in a later blog post). The C++ client allows us to use a foreign function interface in Python to call C++ functions directly from Python.

Options, Options, Options

There are two high-level concepts of how to integrate (C)Python and C: extension modules and embedding.

Embedding Python uses Python as a component in an application. Our aim is to develop a Python API to be used by other Python applications so embeddings are not what we look for.

Extension modules are shared libraries that extend the Python interpreter. For this use-case CPython offers a C API to interact with the Python interpreter and allows to define modules, objects and functions in C which can be called from Python. Note that there is also the option to extend the Python interpreter through a Python library like ctypes or cffi. They are generally easier to use and should preserve portability (extension modules are CPython specific). However they do not give as much flexibility as extension modules and incur in potentially more overhead (see below). There are multiple wrapper frameworks available for CPython's C API to ease development of extension modules. Here is an overview of frameworks and libraries we tested:

Cython: optimising static compiler for Python and the Cython programming language (based on Pyrex). C/C++ function, objects, etc. can be directly accessed from Cython. The compiler generates C code from Cython which interfaces with the CPython C-API.
SWIG: (Simplified Wrapper and Interface Generator) is a tool to connect C/C++ with various high-level languages. C/C++ interfaces that should be available in Python have to be defined in a SWIG interface file. The interface files are compiled to C/C++ wrapper files which interface with the CPython C-API.
Boost.Python: is a C++ library that wraps CPython’s C-API. It uses advanced metaprogramming techniques to simplify the usage and allows wrapping C++ interfaces non-intrusively.
ctypes: is a foreign function library. It allows calling C functions in shared libraries with predefined compatible data types. It does not require writing any glue code and does not interface with the CPython C-API directly.

Benchmarks

In this blog post we focus on the overhead of calling a C/C++ function from Python. We vary the number of arguments, argument types and the return types. We also test passing strings to C/C++ since it is part of the Crail API e.g. when opening or creating a file. Some frameworks expect bytes when passing a string to a underlying const char *, some allow to pass a str and others allow both. If C++ is supported by the framework we also test passing a std::string to a C++ function. Note that we perform all benchmarks with CPython version 3.5.2. We measure the time it takes to call the Python function until it returns. The C/C++ functions are empty, except a return statement where necessary.

The plot shows that adding more arguments to a function increases runtime. Introducing the first argument increases the runtime the most. Adding a the integer return type only increased runtime slightly.

As expected, cytpes as the only test which is not based on extension modules performed the worst. Function call overhead for a function without return value and any arguments is almost 300ns and goes up to 1/2 a microsecond with 4 arguments. Considering that RDMA writes can be performed below 1us this would introduce a major overhead (more on this below in the discussion section).

SWIG and Boost.Python show similar performance where Boost is slightly slower and out of the implementations based on extension modules is the slowest. Cython is also based on extension modules so it was a surprise to us that it showed the best performance of all methods tested. Investigating the performance difference between Cython and our extension module implementation we found that Cython makes better use of the C-API.

Our extension module implementation follows the official tutorial and uses PyArg_ParseTuple to parse the arguments. However as shown below we found that manually unpacking the arguments with PyArg_UnpackTuple already significantly increased the performance. Although these numbers still do not match Cython’s performance we did not further investigate possible optimizations to our code.

Let’s take a look at the string performance. bytes and str is used whereever applicable. To pass strings as bytes the ‘b’ prefix is used.

Again Cython and the extension module implementation with manual unpacking seem to deliver the best performance. Passing a 64bit value in form of a const char * pointer seems to be slightly faster than passing an integer argument (up to 20%). Passing the string to a C++ function which takes a std::string is ~50% slower than passing a const char *, probably because of the instantiation of the underlying data buffer and copying however we have not confirmed this.

Discussion

One might think a difference of 100ns should not really matter and you should anyway not call to often into C/C++. However we believe that this is not true when it comes to latency sensitive or high IOPS applications. For example using RDMA one can perform IO operations below a 1us RTT so 100ns is already a 10% performance hit. Also batching operations (to reduce amount of calls to C) is not feasible for low latency operations since it typically incurs in wait time until the batch size is large enough to be posted. Furthermore, even in high IOPS applications batching is not always feasible and might lead to undesired latency increase.

Efficient IO is typically performed through an asynchronous interface to allow not having to wait for IO to complete to perform the next operation. Even with an asynchronous interface, not only the latency of the operation is affected but the call overhead also limits the maximum IOPS. For example, in the best case scenario, our async call only takes one pointer as an argument so 100ns call overhead. And say our C library is capable of posting 5 million requests per seconds (and is limited by the speed of posting not the device) that calculates to 200ns per operation. If we introduce a 100ns overhead we limit the IOPS to 3.3 million operations per second which is a 1/3 decrease in performance. This is already significant consider using ctypes for such an operation now we are talking about limiting the throughput by a factor of 3.

Besides performance another aspect is the usability of the different approaches. Considering only ease of use ctypes is a clear winner for us. However it only supports to interface with C and is slow. Cython, SWIG and Boost.Python require a similar amount of effort to declare the interfaces, however here Cython clearly wins the performance crown. Writing your own extension module is feasible however as shown above to get the best performance one needs a good understanding of the CPython C-API/internals. From the tested approaches this one requires the most glue code.

Setup & Source Code

All tests were run on the following system:

Intel(R) Core(TM) i7-3770
16GB DDR3-1600MHz
Ubuntu 16.04 / Linux kernel version 4.4.0-142
CPython 3.5.2

The source code is available on GitHub