RonDB YCSB Benchmarks

by Salman Niazi and Mikael Ronström
RonDB provides low Latency, high Availability, high Throughput, scalable Storage (LATS). In this in-depth Article, we present different benchmark results demonstrating the LATS capabilities of RonDB.

We use YCSB, a popular industry standard for benchmarking tool for cloud databases, to benchmark the performance of RonDB. We compare the performance of  (1) using native RonDB clients and (2) using MySQL Server(s). Using native clients not only delivers higher throughput and lower latency but also saves significant CPU resources.  For applications that cannot use native clients, we show that RonDB still delivers very high throughput and low latency via MySQL servers.

You can also navigate via the LATS visual; simply click on the portion of interest.

Part 1: RonDB Performance for In-Memory Data

In the first part, we will demonstrate the low Latency and high Throughput of RonDB. In the second part of the article, we will test the high Availability of RonDB.  We will show that RonDB can handle node failures without significantly impacting the performance of the cluster.

RonDB is an in-memory database,
in the last part we will demonstrate how RonDB can handle datasets (tables) that are too large to fit in the cluster memory. RonDB supports scalable Storage, where such large tables can store their non-indexed data on NVMe drives. We will demonstrate that RonDB manages to deliver low latencies and high throughput for tables stored on disks.

For this test, we have chosen Yahoo! Cloud Serving Benchmark (YCSB), a widely used benchmarking tool. YCSB is a well-known benchmark developed at Yahoo! research. YCSB focuses on comparing the performance of a new generation of cloud data serving systems. YCSB contains many workloads with varying percentages of “Get”, “Put”, “Delete” and “Scan” operations. These workloads represent real-life cloud applications. More information can be found in this paper about benchmarking with YSCB.

Experiment Setup

The benchmark was run in Amazon EC2. The database consisted of two datanodes running on i3.4xlarge (16 vCPUs, 122 GB Mem, 2 x 1,900 GB NVMe SSDs, 10 Gbps network interface) instances with a gp3 root disk of 512 GB volume. These instance types have high memory and support NVMe drives needed to test both the in-memory and on-disk performance of RonDB. 

Data stored in RonDB can be accessed directly, that is, using a native client which performs the operations directly on the RonDB datanodes. We can also access the data stored in RonDB using SQL queries that run on a MySQL server connected to RonDB datanodes (using the NDB storage engine).

RonDB native API does have a security model. We recommend using native API in closed environments where the database is accessed by trusted clients in the network. For secure deployments, we recommend using MySQL server which has a well-known security model. We ran the YCSB benchmarks using both the MySQL Server and using the native Java client (ClusterJ). We ran a total of 3 MySQL Servers using c5.4xlarge instance types (16 VCPUs) and 3 YCSB benchmarking clients also running on c5.4xlarge instances. The RonDB management server was running on a separate c5.xlarge instance. The root disks for YCSB, MySQL Server, and management node were 128 GB gp2 volumes.

Experiment setup with YSCB

For the dataset, we created the following `usertable` table. By default, YCSB uses a table with ten columns. The default column size is 100. We used utf8_unicode_ci encoding for all the columns. Utf8_unicode_ci uses 4 bytes to store each character. Therefore the actual combined size of data columns for each row is ~4 KB not including the primary key. The initial dataset size was set to 60 million rows. 

create table code

In the following section we discuss different YCSB workloads and the performance of RonDB in details. The conclusion of this benchmark is that even a small RonDB cluster consisting of only two datanodes provides excellent performance.  Native clients not only deliver higher throughput and lower latency but also save significant CPU resources. Also, RonDB still delivers very high throughput and low latency via MySQL servers albeit requiring additional CPU resources for the MySQL Server(s). 

Workload A is an update-heavy workload with 50% read operations and 50% Update operations.

It uses Zipfian distribution for record selection. This workload represents applications such as session stores that record users' activities during a user session.  

Using one client for benchmarking, the throughput for ClusterJ and JDBC is 51 thousand ops/sec and 46 thousand ops/sec respectively. Using two benchmark clients, the throughput increases by 51% for ClusterJ based YCSB Clients and 49% for JDBC-based YCSB clients. With three benchmark clients, the throughput increases by 23% for the ClusterJ, 24% for the JDBC JDBC-based YCSB clients.

This means we are writing around 200 MBytes per second to RonDB and reading the same amount of data.

Workload A Throughput

As expected the average latency for reading data is lower using ClusterJ than JDBC. JDBC requires an additional hop to a MySQL Server to read data stored in RonDB.

On average it takes approximately 400 microseconds to read data using ClusterJ and 500 microseconds to read the data using JDBC-based YCSB clients. 95 percentile read latencies are also lower for ClusterJ based YCSB clients than the JDBC base YCSB clients.  95 percentile latency for data read operations is 521,  559, and 542 microseconds for one, two, and three ClusterJ-based YCSB clients respectively. The 95 percentile latency for data read operations is 547, 649, and 657 microseconds for one, two, and three JDBC-based YCSB clients. However, for 99 percentile latency of read operations, we observed that MySQL Server was better at handling the high load. 

Workload A read latency

Similarly, as expected the average latency for updating the data is lower using ClusterJ than JDBC. On average it takes on average 1.42 milliseconds, 1.99 milliseconds, and 2.53 milliseconds to update a row using one, two, and three ClusterJ based YCSB clients respectively. Average latencies using JDBC are slightly higher. JDBC on average takes 1.5 milliseconds, 2.15 milliseconds, and 2.73 milliseconds to update a row using one, two, and three JDBC-based YCSB clients respectively. ClusterJ also has lower 95 and 99 percentile latencies for update operations.

As can be seen, the read latency goes down with more YCSB clients. The likely reason for this is that with only one YCSB client much of the latency comes from latency in the YCSB client. As can be seen from the numbers for 3 YCSB clients the latency coming from the RonDB data nodes is below 1 millisecond even for the 99 percentile.

Workload A Update

Workload B is a read-heavy workload with 95% read operations and 5% Update operations.
It uses Zipfian distribution for record selection. This workload represents applications such as photo tagging. Adding a tag is an update operation, but most operations are read-only.

For one ClusterJ based YCSB client, the average throughput is 97.2 thousand operations per second. For one JDBC-based YCSB client, the average throughput is 81.3 thousand operations per second. Using two benchmarking nodes the throughput increases by 73% for ClusterJ and by 72% for JDBC base YCSB clients.

For 3 benchmarking nodes the throughput only increases by 16% reaching 196.8 thousand operations per second for ClusterJ based YCSB clients, and increases by 30%  reaching 182.6 thousand operations per second for JDBC base YCSB clients. This suggests that for this workload two YCSB clients can almost fully load a  RonDB cluster with two database nodes.

This means we are reading almost 800 MB/sec from RonDB and writing about 40 MB/sec, this we are approaching the limits of the Ethernet bandwidth that is provided in this VM type.

Workload B Throughput

Similar to workload A the average latency for the read operations using ClusterJ is lower than JDBC. Using ClusterJ the average read latency for read operations is 433 microseconds, 485 microseconds, and 614 microseconds for one, two, and three benchmarking clients respectively.

While for JDBC the average read latency is 542 microseconds, 609 microseconds, and 690 microseconds for one, two, and three JDBC-based YCSB clients respectively. 95 percentile read latencies are lower for ClusterJ with one and two YCSB clients.

With three nodes the 95 percentile read latency is more than the JDBC benchmarking setup because ClusterJ is able to squeeze maximum performance for the read operations from the RonDB cluster with two clients and adding additional clients does not significantly increase throughput for this workload, increasing the latency of the operations.

Workload B Read

Similarly, ClusterJ has lower average latency for update operations. Using ClusterJ the average update latency for update operations is 1.4 milliseconds, 1.9 milliseconds, and 2.5 milliseconds for one, two, and three benchmarking clients respectively.

While for JDBC the average update latency is 1.5 milliseconds, 2.15 milliseconds, and 2.7  milliseconds for one, two, and three YCSB clients respectively.  ClusterJ also has lower 95 percentile latencies for update operations. ClusterJ has 3.3 milliseconds, 4.6 milliseconds, and 6.295 milliseconds 99 percentile update latencies for one, two, and three YCSB clients respectively.

While JDBC has 2.9 milliseconds, 5.5 milliseconds, 7.8 milliseconds 99 percentile update latencies for one, two, and three YCSB clients setups respectively. 

Workload B Update

Workload C is a read-heavy workload with 100% read operations. It uses Zipfian distribution for record selection. This workload represents applications such as user profile caching, where profiles are constructed elsewhere (e.g., Hadoop).

For one ClusterJ-based YCSB client, the average throughput is 108.2 thousand operations per second, and for one JDBC-based YCSB client, the average throughput is 86.2 thousand operations per second. The throughput is higher than workload B because this workload does not have any update operations which are significantly slower than read operations. Using two benchmarking nodes the throughput increases by 71% for ClusterJ and by 77% for JDBC base YCSB clients.

For 3 benchmarking nodes the throughput only increases by 18% reaching 219.7 thousand operations per second for ClusterJ based YCSB clients, and increases by 23%  reaching 188.7 thousand operations per second for JDBC base YCSB clients. For read-only workload, this suggests that two YCSB clients can almost fully load a  RonDB cluster with two database nodes. 

In managed RonDB we provide the options to use VM instance types where the servers use 100G Ethernet rather than 25G Ethernet. This removes the network as the limiting factor. In this benchmark, we read almost 900 MByte per second and with overhead we get very close to the network bandwidth that this VM type can handle. Using network-optimized VM types should also improve the latency.

Workload C Throughput

Similar to workload B the average latency for the read operations using ClusterJ is lower than JDBC. Using ClusterJ the average read latency for read operations is 431 microseconds, 505 microseconds, and 638 microseconds for one, two, and three benchmarking clients respectively. While for JDBC the average read latency is 546 microseconds, 617 microseconds, and 750 microseconds for one, two, and three JDBC-based YCSB clients respectively.

95 percentile read latencies are lower for ClusterJ. ClusterJ has 99 percentile read latencies of 774 microseconds, 2.044 milliseconds, and 1.8 milliseconds; while JDBC has 99 percentile read latencies of 914 microseconds, 1.211 milliseconds, and 1.67 milliseconds for one, two, three benchmarking clients respectively.

One important reason why ClusterJ has a slightly higher 99% latency number is simply that it has higher throughput and the benchmark is executed at the limits of both CPU and network performance.

Workload B Read

Workload D is also a read-heavy workload with 95% read operations and 5% insert (PUT) operations. It uses the "Latest" request distribution scheme for record selection. This workload represents applications, such as user status updates; people want to read the latest statuses.

This workload is very similar to workload B.  Workload B contains 5% update operations that modify the existing rows in the dataset; while workload D contains 5% insert operations, that is new rows are added to the dataset, and the size of the dataset increases as the workload runs. For one ClusterJ based YCSB client, the average throughput is 98.8 thousand operations per second, and for one JDBC-based YCSB client, the average throughput is 78.1 thousand operations per second. Similar to workload B using two benchmarking nodes the throughput increases by 73% for ClusterJ and by 72% for JDBC base YCSB clients. For 3 benchmarking nodes the throughput only increases by 13% reaching 194.6 thousand operations per second for ClusterJ based YCSB clients, and increases by 30%  reaching 175.9 thousand operations per second for JDBC base YCSB clients.

This benchmark shows that Update operations and Insert operations have very similar characteristics. The Insert operation is slightly more expensive in RonDB compared to Update, but the difference is fairly small. Delete operations are cheaper simply since they do not need to perform any writes of the data, it is similar to very small update operations.

Workload D Throughput

The read operation latency statistics are similar to workload B. Using ClusterJ the average read latency for read operations is 431 microseconds, 479 microseconds, and 624 microseconds for one, two, and three benchmarking clients respectively. While for JDBC the average read latency is 551 microseconds, 618 microseconds, and 702 microseconds for one, two, and three JDBC-based YCSB clients respectively. 95 percentile read latencies are lower for ClusterJ with one and two YCSB clients.

With three nodes the 95 percentile read latency is more than the JDBC benchmarking setup because ClusterJ is able to squeeze maximum performance for the read operations from the RonDB cluster with two clients and adding additional clients does not significantly increase throughput for this workload, increasing the latency of the operations. Similarly, 99 percentile latencies for read operations are lower for ClusterJ for setups with one, and two YCSB clients.

ClusterJ has 765 microseconds, 1.03 milliseconds, and 2.04 milliseconds 99 percentile read latency; while JDBC has 958 microseconds, 1.2 milliseconds, and 1.56 milliseconds 99 percentile read latency for one, two, and three-node YCSB benchmarking setups

Workload D Read

ClusterJ has lower average latency for Insert operations. Using ClusterJ the average latency for insert operations is 1.26 milliseconds, 1.83 milliseconds, and 2.56 milliseconds for one, two, and three benchmarking clients respectively. While for JDBC the average insert latency is 1.59 milliseconds, 2.22 milliseconds, and 2.77  milliseconds for one, two, and three YCSB clients respectively. 

ClusterJ also has lower 95  percentile latencies for insert operations. ClusterJ has 2.3 milliseconds, 3.8 milliseconds, and 7.26 millisecond 99 percentile insert latencies for one, two, and three YCSB clients respectively. While JDBC has 2.8 milliseconds, 4.3 milliseconds, 5.6 milliseconds 99 percentile insert latencies for one, two, and three YCSB clients setups respectively. 

Workload D Insert

Workload E is also a read-heavy workload. However, unlike the previous workloads, it comprises 95% scan operations and 5% insert operations. It uses Zipfian distribution for record selection, and it uses uniform distribution for the length of the scan operations. Default scan length is set to 100. This workload represents application with threaded conversations, where each scan is for the posts in a given thread (assumed to be clustered by thread id)

As this workload contains 95% scan operations, this workload reads a significant amount of more data over the network than any other workload in YCSB. For one ClusterJ based YCSB client, the average throughput is 4.4 thousand operations per second, and for one JDBC-based YCSB client, the average throughput is 12.8 thousand operations per second. Using two benchmarking nodes the throughput increases by 104% for ClusterJ and by 79% for JDBC base YCSB clients.

For 3 benchmarking nodes the throughput increases by 47% reaching 13.3 thousand operations per second for ClusterJ based YCSB clients, and increases by 5%  reaching 24.4 thousand operations per second for JDBC base YCSB clients. In this case, JDBC outperforms ClusterJ as MySQL is optimized for performing multiple scan operations in parallel. Although ClusterJ API supports scan operations, the implementation of the scan operations is not optimal.

Workload E Throughput

For workload E, ClusterJ has lower latency for insert operations. Using ClusterJ the average latency for insert operations is 1.96  milliseconds, 2.2 milliseconds, and 2.5 milliseconds for one, two, and three benchmarking clients respectively.

While for JDBC the average insert latency is 2.6 milliseconds, 3.7 milliseconds, and 4.9 milliseconds. for one, two, and three JDBC-based YCSB clients respectively. ClusterJ has lower 95 and 99 percentile latencies for insert operations to two and three-node YCSB benchmarking setups. For single YCSB node benchmarking setups, ClusterJ has slightly higher latency due to unoptimized implementation of scan operations putting a load on the client. 

Workload E insert

The following graph shows the latency for scan operations. ClusterJ has two to three times higher latency for scan operations than JDBC. As mentioned earlier this is due to the unoptimized implementation of scan operations in ClusterJ API.

For 99 percentile latencies for scan operation, ClusterJ has 41.2 milliseconds, 39.151 milliseconds, and 39.103 milliseconds latency for one, two, and three-node YCSB benchmarking setups respectively. While JDBC has 7.4 milliseconds, 12.139 milliseconds, and 22.4 milliseconds 99 percentile latency for scan operations for one, two, and three YCSB benchmarking setups respectively. 

Workload E Scan

Workload F is an update-heavy workload with 50% read operations and 50% update operations. This workload is similar to workload A, which also contains 50% read operations and 50% update operations.

The main difference between the two workloads is that workload A does not guarantee that it updates the same rows returned by the read operations. While workload F only updates the rows that are read from the database. It uses Zipfian distribution for record selection. This workload represents applications such as a user database, where user records are read and modified by the user. 

Using one client for benchmarking, the throughput for ClusterJ and JDBC is 42 thousand ops/sec and 37 thousand ops/sec respectively. Using two benchmark clients, the throughput increases by 54% for ClusterJ based YCSB Clients and 62% for JDB based YCSB clients. With three benchmark clients, the throughput increases by 11% for the ClusterJ, 26% for the JDBC JDBC-based YCSB clients. Maximum throughput observed using ClusterJ was 73 thousand operations per second, and 76 thousand operations per second using JDBC. 

Workload F Throughput

There is a minor difference in latencies for read-modify-write operations for one and two-node benchmarking setups. For three-node benchmarking setups ClusterJ has higher latency for read-modify-write operations. 

Workload F Throughput

Part 2: RonDB Performance and recovery during failures

Experiment Setup

The experiments are set up exactly as described in the first part detailing the performance of RonDB for in-memory datasets. The only difference is that this time we used three RonDB datanodes to test failures of datanodes. In these experiments, we run the three YCSB clients using ClusterJ and record performance statistics every ten seconds. For these experiments, we choose to run a read-intensive workload with 95% read, and 5% update operations(workload B, see part 1).

After every 10 seconds, we record average throughput, read, and update operations’ latencies in the last 10 seconds.
There are two restart scenarios for a RonDB datanode. 1) Node restart: this is an online restart of a cluster node while the database cluster itself is running. Node restart is needed when the database process fails due to a bug or for online upgrades. 2) Initial node restart: This is the same as a node restart, except that the node is reinitialized and started with a clean file system.

Datanode Restart 

When a node is started it goes through three main stages. The operations performed in these stages differ slightly depending on whether it is a normal node restart or initial node restart. 

For normal restart following operations are performed in these stages:
Local Recovery Stage (LRS): In this stage modules for node initialization, cluster management, and signal handling are started. The node starts the heartbeat protocol and joins the cluster. Then it reads data from the local checkpoint. It applies all redo information until it reaches the latest restorable global checkpoint and rebuilds indexes for the restored data
Sync Stage (SS): The node synchronizes its data by pulling the latest data changes from other live nodes in the cluster.
Wait for LCP Stage (WLCP): Once all the data has been synchronized it waits for local checkpoints to restart. After this stage is done the newly-joined node takes over responsibility for delivering its primary data to subscribers.

The following graphs show throughput and latencies for read and update operations when one node in the cluster is restarted. The horizontal axis represents test run duration, and the vertical axis shows throughput or latency in milliseconds.

The following table shows different times taken by each restart stage. 

Table _ Experiment on failure

It must be noted that the cluster is operating at maximum load while performing the node restart. The RonDB cluster is operating normally for the first 800 seconds.  The average throughput of the cluster is around 220 thousand operations per second before the restart is initiated.

One of the nodes in the cluster is restarted, indicated by "R" in the graphs. We see a sudden drop in the performance of the cluster. There are multiple reasons why we observed a drop in performance at the point of datanode failure. First, when a RonDB datanode fails then all the active transactions by the clients will fail. Subsequent operations keep failing until the clients' backend threads update the cluster topology and start working normally. This may take a second or more if the failure is found through heartbeats, if the node fails in a controlled manner or if the operating system shuts down the process, the cluster topology reconfiguration will only take a few milliseconds.

Second, in this experiment, we have a fixed number of clients (3 clients, one per YCSB node) that are connected to RonDB nodes. Each client that had an active transaction in the failed node will have to wait until the node failure handling has ensured all ongoing transactions were handled, this can take a few seconds in a busy cluster like this. Clients not blocked by node failure handling can proceed immediately after cluster topology reconfiguration, thus after a few milliseconds in the normal case. The drop in performance will thus be lower if there are more clients.

Throughput datanode restart

In the graphs we have shown the end of each of the three main phases, LRS, SS and WLCP.

After losing one node the throughput of the cluster drops from 220 thousand operations per second and varies between 180 and 200 thousand operations per second until the failed node joins the cluster again. The read latency goes up as now there are few nodes to read the data from, that is, before the node failure, a read operation could have been satisfied by one of the three nodes. After failure, we are left with two RonDB datanodes that have to pick up the read workload for the failed node. This puts additional load on the remaining nodes and read latency increases. On the other hand, the write latency goes down as now the data is written to two nodes instead of three datanodes.

When the datanode comes back online, after the completion of stage WLCP, we see that the throughput increases and the read latency go down as now the read operations can be satisfied at one of the three nodes. The write latency goes up to pre-failure levels as now the write operations have to update three replicas instead of two. 

average read latency datanode restartaverage update latency datanode restart

Datanode Initialization

This test deals with hard failures, for example, when a node fails due to hardware failure then the system administrator may start (aka initialize) a new node. Another reason could be that we want to add one more replica to the cluster, thus adding a new VM to the RonDB cluster.

For initial node restart following operations are performed.
Local Recovery Stage (LRS): In this stage modules for node initialization, cluster management, and signal handling are started. The datanode file systems are cleared, the node is assigned a nodeID, cluster connections are set up, inter-block communications are established, and metadata is copied. Finally, the node starts the heartbeat protocol and joins the cluster. This phase also creates the REDO log files, UNDO log files and tablespace files.
Synch Stage (SS): The node synchronizes its data by first pulling the latest data. It rebuilds indexes for the restored data. It then performs a second synchronization bringing it online. It then writes all the data to a local checkpoint.
Wait for LCP Stage (WLCP): Once all the data has been synchronized it waits for local checkpoints to complete. After this phase the newly-joined node takes over responsibility for delivering its primary data to subscribers.

The five different stages as described above are marked with numbers on the graph at times when these stages complete their operations. The following table shows different times taken by each restart stage. 

Table _ Datanode initialization

Similar to the previous experiment, the cluster is operating at maximum load. The following graphs show throughput and latencies for read and update operations when one node in the cluster is re-initialized.  

The RonDB cluster is operating normally for the first 700 seconds.  As before the average throughput of the cluster is around 220 thousand operations per second. Then one of the nodes in the cluster is stopped and it is restarted with the "initial" flag. This wipes the file system of the datanode, and the datanode starts from a clean slate.   The initialization of the datanode is marked with "init" in the graphs. 

The graphs are very similar to the previous experiment and all of the arguments regarding the changes in the throughput and latencies are also valid here. Here, we will highlight only the differences. In LRS although there is no local checkpoint to recover the data, it takes some time to complete as RonDB pre-allocate disk space for REDO log files. By default, RonDB allocates 64 GB of REDO log files.

This takes time as EBS drives in AWS can not sustain high write throughput for a long duration. SS stage takes the most time as the datanode does not have any data and it has to pull all the data from other datanodes.

An initial Node restart has to write a complete checkpoint since it didn’t have any previous LCPs to use as a base. This takes longer, effectively this is performed by first running a “local” LCP in the starting node, and when this is done we participate in a normal distributed LCP. It is the first “local” LCP in the initial node restart that takes the extra time since during this time we have to write the entire database to disk.

Throughtput (initial Datanode Start)Average read latency (initial Datanode Start)Average update latency (initial datanode start)

Conclusion
RonDB can handle a very high load even during node restarts and also when initializing new nodes into the cluster. The impact on latency is fairly small during those recovery operations. Thus RonDB shows its LATS capabilities during high load and recovery situations.

Part 3: RonDB Scalable Storage.

On Disk Data Performance

We will demonstrate how RonDB can handle datasets (tables) that are too large to fit in the cluster memory. RonDB supports scalable Storage, where such large tables can be stored on NVMe drives, and RonDB manages to deliver low latencies and high throughput for data stored on disk(s).

Experiment Setup

The experiments are set up exactly as described in the first two parts. The only difference is that this time we use the NVMe Drives on the RonDB datanodes. In our setup, we configured RonDB to store the YCSB tables on NVMe Drives.

We only use two YCSB benchmarking nodes as we observed in our experiments that we only needed up to two YCSB clients to squeeze maximum performance from RonDB using NVMe Drives. We performed the experiments with two different row sizes, 4 KB and 16 KB size rows. 

Conclusion
RonDB can handle very large data sizes with very large bandwidth at low latency. Compared to in-memory tables we see obviously a limitation in that NVMe drives have higher latency and lower throughput.

At the same time, we show here that the performance and latency are still at an excellent level even for tables storing non-indexed columns on disk(s). In other experiments, we have tested RonDB to scale to at least 30 TByte of disk data per datanode, and at least 2 GByte per second of read and write bandwidth.

Thus with this set of three blogs, we have shown that RonDB can operate under all four quality aspects of LATS (low Latency, high Availability, high Throughput, and scalable Storage) at the same time. This makes RonDB uniquely qualified for use in applications such as Feature Stores, On-line Gaming, On-line Financial applications, networking applications, and all sorts of telecom applications.

Workload A is an update-heavy workload with 50% read operations and 50% Update operations. It uses Zipfian distribution for record selection. This workload represents applications such as session stores that record users' activities during a user session.  

For 4 KB row size, maximum throughput is achieved by one YCSB client using ClusterJ. Further adding YCSB clients did not improve the performance. With JDBC, we observe a throughput of 32.5 thousand operations per second, which only increases by 11%, reaching 36 thousand operations per second using two YCSB clients. For 16 KB both the ClusterJ and JDBC have a similar performance of approximately 26 thousand operations per second.

Workload A Throughput

The following graphs show latency for read and update operations. We see that ClusterJ has lower latency for all the setups with one YCSB client. In some cases the latency for ClusterJ is higher using two YCSB clients, this is because in those cases ClusterJ managed to achieve maximum throughput using one 1 YCSB client and adding more YCSB clients did not increase throughput,  and it only resulted in higher latency for operations.

This clearly shows that performance, in this case, is completely bound to the performance of the NVMe drives. We expect major improvements to this in AWS with the introduction of the i4 VM types (around 60% improvements are expected).

Average Read LatencyA - 95 Percentile Read LatencyA - 99 Percentile Read LatencyA - Avg Update LatencyA - 95 Percentile Update Latency A - 99 Percentile Update Latency

Workload B is a read-heavy workload with 95% read operations and 5% Update operations. It uses Zipfian distribution for record selection. This workload represents applications such as photo tagging. Adding a tag is an update operation, but most operations are read-only.

For both 4 KB and 16 KB, ClusterJ delivers higher throughput than JDBC. For 4 KB row size, we observed a maximum throughput of 109 thousand per second using ClusterJ, while we only managed to achieve a throughput of 94 thousand per second using JDBC. Similarly, for 16 KB row size, ClusterJ had a throughput of 79 thousand operations per second and JDBC had a throughput of 64.7 thousand operations per second.

B - ThroughputB - Avg Read LatencyB - 95 Percentile Read LatencyB - 99 percentile Read LatencyB - Avg update latencyB - 95 Percentile Update LatencyB - 99 Percentile Update Latency

Workload C is a read-heavy workload with 100% read operations. It uses Zipfian distribution for record selection. This workload represents applications such as user profile caching, where profiles are constructed elsewhere (e.g., Hadoop).

For 4 KB row size, we observed maximum throughput of 159.8 thousand per second using ClusterJ, while we only managed to achieve a throughput of 138.6 thousand per second using JDBC. Similarly, for 16 KB row size, ClusterJ had a throughput of 97.8 thousand operations per second and JDBC had a throughput of 102.1 thousand operations per second.

Workload C ThroughputC - Avg Read LatencyC - 95 percentile Read LatencyC - 99 Percentile Read Latency

Workload D is also a read-heavy workload with 95% read operations and 5% insert (PUT) operations. It uses the "Latest" request distribution scheme for record selection. This workload represents applications, such as user status updates; people want to read the latest statuses.

For 4 KB row size, we observed maximum throughput of 125.2 thousand per second using ClusterJ, while we only managed to achieve a throughput of 116.9 thousand per second using JDBC. Similarly, for 16 KB row size, ClusterJ had a throughput of 59.9 thousand operations per second and JDBC had a throughput of 66.1 thousand operations per second.

Workload D ThroughputD -  Avg Read LatencyD - 95 Percentile Read latencyD - 99 Percentile Read LatencyD - Avg Update LatencyD - 95 Percentile Update LatencyD - 99 Percentile Update Latency

Workload E is also a read-heavy workload. However, unlike the previous workloads, it comprises 95% scan operations and 5% insert operations. It uses Zipfian distribution for record selection, and it uses uniform distribution for the length of the scan operations. Default scan length is set to 100. This workload represents application with threaded conversations, where each scan is for the posts in a given thread (assumed to be clustered by thread id)

For 4 KB row size, we observed a maximum throughput of 9.7 thousand per second using ClusterJ and throughput of 10.3 thousand per second using JDBC using one YCSB client. With two YCSB clients, the throughput did not increase, in fact for ClusterJ the performance was reduced by 18% due to additional load caused by two YCSB clients. For 16 KB row size, ClusterJ had a throughput of 4.5 thousand operations per second, and JDBC had a throughput of 5.2 thousand operations per second.

E - ThroughputE - Avg Read LatencyE - 95 Percentile Read latencyE - 99 Percentile Read LatencyE - Avg Update LatencyE - 95 Percentile Update LatencyE - 99 Percentile Update Latency

Workload F is an update-heavy workload with 50% read operations and 50% Update operations. This workload is similar to workload A, which also contains 50% read operations and 50% update operations.

The main difference between the two workloads is that workload A does not guarantee that it updates the same rows returned by the read operations. While workload F only updates the rows that are read from the database. It uses Zipfian distribution for record selection. This workload represents applications such as a user database, where user records are read and modified by the user. 

For the 4 KB row size, we observed maximum throughput of 34.6 thousand per second using ClusterJ and JDBC using two YCSB clients. For 16 KB row size, ClusterJ had a throughput of 22 thousand operations per second, and JDBC had a throughput of 23.8 thousand operations per second.

F - ThroughputF - Avg RMW LatencyF -  95 Percentile RMW LatencyF - 99 Percentile RMW Latency