What if your database could remember everything—every edit, every state, every version—across time, while scaling effortlessly beyond the limits of a single machine?
That’s exactly the kind of challenge we’ve been tackling with Dydra, a revisioned RDF graph store, and RonDB, a high-performance, clustered database built for scale.
At its core, Dydra is a flexible RDF graph storage system. It speaks SPARQL, GraphQL, Linked Data Platform (LDP), and other web-native data protocols. You can use it in the cloud, run it locally, or embed it in your application. Whether you’re querying a personal dataset or building a collaborative data platform, Dydra acts as the semantic backbone, allowing you to reason about your data—not just as projections, but as a versioned timeline.
And that’s where Dydra becomes special.
Dydra doesn’t just store the current state of your graph. It also retains previous store states—fully addressable, versioned snapshots—like a Git for graphs. These are accessible via a REVISION argument in SPARQL queries, or even streamed incrementally using MQTT with a REVISION-WINDOW. This turns the graph into a temporal data structure, enabling collaborative use cases where each client has a consistent, convergent view of the data over time.
Internally, this is implemented by annotating each RDF statement with temporal metadata—think of it as “time-traveling triples.” Combined with smart transfer protocols, Dydra can act like a conflict-free replicated data type (CRDT) for graphs. This makes it suitable for distributed collaboration and even disconnected operation.
In short: Dydra stores the what and the when of your data. But to do so efficiently, it needs a fast, scalable backend.
Dydra’s default backend has long been LMDB, a memory-mapped key–value store. LMDB is elegant, efficient, and impressively fast on single-server setups. On large systems with many cores and multiple non-uniform memory access (NUMA) nodes, it handles billions of triples with ease—our biggest production repository reached 3.25 billion RDF statements.
We will certainly continue to use LMDB as a very reliable storage backend.
But LMDB has limits:
For a distributed service with long-term ambitions (think: trillion-triple graphs), that wasn’t going to keep up.
That’s where RonDB enters the picture.
RonDB is not your typical SQL database. Originally derived from MySQL NDB Cluster, it has been heavily optimized into a high-availability, low-latency, and cloud-native DBMS.
It features:
In short, RonDB brings things which LMDB does not: replication, clustering, and cloud-native orchestration.
We’ve already migrated some Dydra repositories to RonDB clusters—both local and remote—and reached the same 3.25B triple milestone, now backed by a clustered system. That’s only the beginning. With RonDB, we’re building toward 100B, 500B, and eventually 1T RDF statement repositories.
This isn’t theoretical. It’s a concrete step toward a trillion-triple store, and we’re just getting started.
A cluster backend like RonDB offers not just more storage and replication—it opens up a deeper possibility: pushing query logic down into the data layer itself.
For us, that meant extending RonDB’s internal interpreter—originally designed for lightweight filtering logic—and teaching it how to execute rich, client-defined logic in parallel on all data nodes.
We started by exposing the full RonDB NDBAPI to Common Lisp. This library, called cl-ndbapi, lets us talk to RonDB clusters directly and efficiently from Lisp, with everything that the NDBAPI offers. This gave us the low-level hooks we needed to generate and compile interpreted programs from Lisp code.
In other words, we built a cross-language compiler that translates Lisp functions into RonDB’s interpreted code language. The program represented by the generated code is deployed onto the data nodes and executed close to the data, massively reducing the volume of intermediate results transferred to the application.
Instead of pulling millions and billions of candidate rows to the client to figure out which statements were valid at a certain revision, we now resolve the revision logic inside RonDB.
That means:
Let’s look at SPARQL query that collects the prevalence of descriptors across thesaurus revisions in a dataset of the Standard Thesaurus for Economics (STW):
prefix skos: <http://www.w3.org/2004/02/skos/core#>
prefix zbwext: <http://zbw.eu/namespaces/zbw-extensions/>
#
# Show the number of versions in which a descriptor is present
#
select ?prevalence (count(?prevalence) as ?frequency)
where {
select ?s (count(?r) as ?prevalence)
where {
revision ?r {?s a zbwext:Descriptor . }
} group by ? s
}
group by ?prevalence
order by desc (?frequency)
This query makes use of the REVISION clause, which is analogous to a GRAPH or SERVICE clause in SPARQL and which was introduced in our industry paper "Transaction-Time Queries in Dydra" by James Anderson and Arto Bendiken, presented at the MEPDaW workshop 2016. Thus this REVISION clause allows to query revisions with the same ease with which you can query graphs or formulate federated queries.
Traditionally, something like this would require acquiring the full set of candidate statements and filtering them post-hoc by validity intervals. But with revision filtering now inlined inside RonDB, each data node returns only the statements already filtered by temporal logic. We use RonDB's internal interpreter to evaluate revision intervals on each partition in parallel.
Result? The query is faster, and the system uses less memory and bandwidth.
And since the programs can now take arguments, we don’t even need to generate a new interpreted program for each revision—just reuse an interpreted program with different parameters.
To support this tight integration, RonDB was extended in several key ways through collaboration between the teams behind RonDB and Dydra:
These extensions turned RonDB from a passive store into an active execution platform—capable of hosting revision-aware SPARQL logic, running it in parallel across data nodes, and returning only what’s truly needed.
By pushing the revision filtering logic into RonDB’s execution layer, we offload expensive temporal reasoning from the client. This avoids loading massive historical datasets into memory and lets Dydra behave more like a streaming, temporal graph engine—while scaling toward hundreds of billions of triples.
Dydra gains:
This is just the first phase. With RonDB’s parallelism and replication, and Dydra’s expressive revision model, we’re building a database that not only scales across machines, but also across time.
Next up:
And of course, a trillion triples.
This article was written by the team at Dydra, with support from Hopsworks.