Deployment Options for Tiered Storage Disaggregation

13 Mar 2019,

In the last blog post we discussed the basic design of the Crail disaggregated shuffler as well as its performance under different configurations for two workloads. In this short follow-up blog we briefly describe the various options in Crail for deploying disaggregated storage.

Mixing Disaggregated and Co-located Configurations

In a traditional "non-disaggregated" Crail deployment, the Crail storage servers are deployed co-located with the compute nodes running the data processing workloads. By contrast, a disaggregated Crail deployment refers to a setup where the Crail storage servers -- or more precisely, the storage resources exposed by the storage servers -- are seperated (via a network) from the compute servers running the data processing workloads. Storage disaggregation may be implemented at the level of an entire data center (by provisioning dedicated compute and storage racks), or at the level of individual racks (by dedicating some nodes in a rack exlusively for storage).

Remember that Crail is a tiered storage system where each storage tier consists of a subset of storage servers. Crail permits each storage tier (e.g., RDMA/DRAM, NVMf/flash, etc.) to be deployed and configured independently. This means we can decide to disaggregate one storage tier but use a co-located setup for another tier. For instance, it is more natural to disaggregate the flash storage tier than to disaggregate the memory tier. High-density all-flash storage enclosures are commonly available and often provide NVMe over Fabrics (NVMf) connectivity, thus, exposing such a flash enclosure in Crail is straightforward. High-density memory servers, on the other hand (e.g., AWS x1e.32xlarge), would be wasted if we were not using the CPU to run memory intensive computations. Exporting the memory of compute servers into Crail, however, may still make sense as it allows any server to operate on remote memory as soon as it runs out of local memory. The figure below illustrates three possible configurations of Crail for a single rack deployment:




Remember that a Crail storage server is entirely a control path entity, responsible for (a) registering storage resources (and corresponding access endpoints) with Crail metadata servers, and (b) monitoring the health of the storage resources and reporting this information to the Crail metadata servers. Therefore, a storage server does not necessarily need to run co-located with the storage resource it exports. For instance, one may export an all-flash storage enclosure in Crail by deploying a Crail storage server on one of the compute nodes.

Fine-grained Tiering using Storage and Location Classes

In all of the previously discussed configurations there is a one-to-one mapping between storage media type and storage tier. There are situations, however, where it can be useful to configure multiple storage tiers of a particular media type. For instance, consider a setup where the compute nodes have access to disaggregated flash (e.g., on a remote rack) but are also attached to some amount of local flash. In this case, you may want to priotize the use of flash in the same rack over disaggregated flash in a different rack. And of course you want to also priortize DRAM over any flash if DRAM is available. The way this is done in Crail is through storage and location classes. A reasonable configuration would be to create three storage classes. The first storage class contains the combined DRAM of all compute nodes, the second storage class contains all of the local flash, and the third storage class represents disaggregated flash. The figure below illustrates such a configuration with three storage classes in a simplified single-rack deployment.




Storage classes can easily be defined in the slaves file as follows (see the Crail documentation for details):

crail@clustermaster:~$ cat $CRAIL_HOME/conf/slaves
clusternode1 -t org.apache.crail.storage.rdma.RdmaStorageTier -c 0
clusternode2 -t org.apache.crail.storage.rdma.RdmaStorageTier -c 0
clusternode1 -t org.apache.crail.storage.nvmf.NvmfStorageTier -c 1
clusternode2 -t org.apache.crail.storage.nvmf.NvmfStorageTier -c 1
disaggnode -t org.apache.crail.storage.nvmf.NvmfStorageTier -c 2

One can also manually attach a storage server to a particular storage class:

crail@clusternode2:~$ $CRAIL_HOME/bin/crail datanode -t org.apache.crail.storage.nvmf.NvmfStorageTier -c 2

Remember that the storage class ID is implicitly ordering the storage tiers. During writes, Crail either allocates blocks from the highest priority tier that has free space, or from a specific tier if explicitly requested. The following timeline shows a set of Crail operations and a possible resource allocation in a 3-tier Crail deployment (abbreviations: W-A refers to the create and write operation of a file A, D-A refers to the deletion of file A). Note that at time t10 the system runs out of DRAM space across the entire rack, forcing file C to be partially allocated in local flash. At time t11 the system runs out of tier 1 storage, forcing file D to be partially allocated in disaggregated flash. Subsequently, a set of delete operations (time t13 and t14) free up space in tier 0, allowing file F to be allocated in DRAM.




If applications want to further prioritize the specific local resource of a machine over any other resource in the same storage class they can do so via the location class parameter when creating an object in Crail.

CrailLocationClass local = fs.getLocationClass();
CrailFile file = fs.create("/tmp.dat", CrailNodeType.DATAFILE, CrailStorageClass.DEFAULT, local).get().asFile();

In this case, Crail would first try to allocate storage blocks local to the client machine. Note also that the location class preference is always weighed lower than the storage class preference, therefore Crail would still prioritize a remote block over a local block if the remote block is part of a higher priority storage class. In any case, if no local block can be found, Crail falls back to the default policy of filling up storage tiers in their oder of preference.

Resource Provisioning

During the deployment of Crail, one has to decide on the storage capacity of each individual storage tier or storage class, which is a non-trivial task. One approach is to provision sufficient capacity to make sure that under normal operation the storage demands can be served by the the highest performing storage class, and then allocate additional resources in the local and disaggregated flash tiers to absorb the peak storage demands.




Ideally, we would want individual storage tiers to be elastic in a way that storage capacities can be adjutsed dynamically (and automatically) based on the load. Currently, Crail does not provide elastic storage tiers (adding storage servers on the fly is always possible, but not removing). A recent research project has been exploring how to build elastic storage in the context of serverless computing and in the future we might integrate some of these ideas into Crail as well. Have a look at the Pocket OSDI'18 paper for more details or check out the system at https://github.com/stanford-mast/pocket.

Summary

In this blog we discussed various configuration options in Crail for deploying tiered disaggrated storage. Crail allows mixing traditional non-disaggregated storage with disaggregated storage in a single storage namespace and is thereby able to seamlessly absorb peak storage demands while offering excellent performance during regular operation. Storage classes and location classes in Crail further provide fine-grained control over how storage resources are provisoned and allocated. In the future, we are considering to make resource provisioning in Crail dynamic and automatic, similar to Pocket.