Jepsen is an open-source distributed systems testing framework created by Kyle Kingsbury (aphyr) at Jepsen LLC. It exercises databases under real network partitions, process crashes, pauses, and clock skew, then uses formal checkers (Knossos for linearizability, Elle for transaction isolation) to detect correctness violations.

How many tests were run, and what was the result?

34 tests in total: 4 leader-only workloads (bank, set, elle, register) crossed with 5 fault scenarios (none, partition, kill, pause, all) for 20 tests, plus 2 follower-read workloads (register-follower, register-bookmark) crossed with 7 fault scenarios (adding clock skew and all+clock) for 14 more. All 34 passed.

Which ArcadeDB version was tested?

The apache-ratis branch, which uses Apache Ratis as the Raft consensus implementation for HA. The full Ratis-based HA stack will ship in an upcoming ArcadeDB release.

Where can I find the test suite?

The full test suite is open source at https://github.com/ArcadeData/arcadedb-jepsen under the Apache 2.0 license. It includes Docker setup, all six workloads, the nemesis fault injectors, and the run-all-tests.sh script that reproduces the 34-test sweep.

Why is G1c excluded from the Elle workload?

ArcadeDB writes commit atomically inside the transaction, but reads in our Elle harness execute as separate HTTP calls after commit. That setup makes G1c (circular information flow) cycles a test-implementation artifact rather than a real isolation violation, so we exclude that anomaly class from the checker.

How is ArcadeDB's Raft HA layer tested beyond Jepsen?

The apache-ratis branch ships with 81 dedicated test classes and over 327 individual test cases (33 unit + 48 end-to-end integration), running on every commit. They cover leader election and failover, 2/3/5-node replication, split-brain recovery, dynamic cluster membership, snapshot install/swap/throttle, leader crashes between commit phases, follower catch-up from WAL and snapshot, schema replication, read-your-writes, and concurrent HTTP/gRPC load with intentional crashes, partitions, and corrupted snapshots. Jepsen adds the formal-checker layer (Knossos and Elle) on top of that suite.

What did this test suite NOT cover?

Long-duration runs (each combination ran for minutes, not hours), disk corruption and fsync-lying / Byzantine faults, geo-replication scenarios with cross-region latency, and the worst-case compound stack of expired Raft lease plus clock skew plus active partition for follower reads. We tested those faults individually but not all three together.

Call Me Maybe, ArcadeDB? 34 Jepsen Tests, 34 Passes

Q: Did Jepsen LLC officially certify ArcadeDB?

No. This is an in-house test suite written by the ArcadeDB team using the open-source Jepsen framework. Jepsen LLC did not commission, run, review, or certify these tests. We're calling for community review (and would welcome a real Jepsen analysis from Kyle if he's interested).

If you’ve followed distributed databases for any length of time, you’ve probably read a Jepsen analysis. If you’ve read one, you know the feeling: a database vendor claims linearizability, Kyle Kingsbury introduces some network partitions, and a few weeks later we all learn what the database actually does under failure.

That feeling is the reason we wrote 34 Jepsen tests for ArcadeDB. We wanted to know what we actually do under failure, before we ask anyone else to trust us.

Today we’re publishing the full test suite, the methodology, and the results.

First, the disclaimer. This is not an official Jepsen analysis. Jepsen LLC did not commission, run, review, or certify these tests. We wrote them in-house using the open-source Jepsen framework (the same framework Kyle uses for his official analyses), but the design, execution, and results are entirely ours. We’re publishing everything so the community can scrutinize the methodology, and we’d genuinely love a real analysis from Jepsen LLC one day. Hi Kyle, if you’re reading this, please tear it apart.

Summary

Database under test: ArcadeDB on the apache-ratis branch, with high availability built on Apache Ratis (Raft consensus).
Cluster: 5 Debian nodes in Docker, controlled by a Jepsen 0.3.11 control node.
Workloads (6): bank, set, elle, register, register-follower, register-bookmark.
Faults (7 nemeses): none, partition, kill, pause, clock, all, all+clock.
Total runs: 34 (20 leader workloads + 14 follower workloads).
Result: 34 / 34 PASS. Zero linearizability violations, zero lost writes, zero ACID anomalies.
Source code: github.com/ArcadeData/arcadedb-jepsen (Apache 2.0).
Caveat: This is in-house testing, not a Jepsen LLC certification. Independent review welcome.

What is Jepsen?

Jepsen is the gold-standard open-source framework for testing distributed systems. Created by Kyle Kingsbury (better known as aphyr), it became famous through the Call Me Maybe blog series, which methodically dismantled the consistency claims of databases like MongoDB, Redis, Cassandra, ElasticSearch, and many others.

What makes Jepsen special isn’t just the fault injection (network partitions via iptables, process kills with SIGKILL, GC-style pauses with SIGSTOP/SIGCONT, clock skew via date -s). It’s the checkers:

Knossos: a linearizability checker that takes the history of operations and tries to find a serial ordering consistent with each client’s observed responses. If no such ordering exists, your “linearizable” register isn’t.
Elle: a black-box transaction-isolation checker that builds a dependency graph from the transaction history and looks for cycles. Cycles map to specific anomalies: G0 (dirty write), G1a (aborted read), G1b (intermediate read), G1c (circular information flow), G2 (anti-dependency cycle), and lost updates.

You can’t bluff your way past either of them. They either find a counterexample, or they certify the history.

What we tested

The tests run against the ArcadeDB apache-ratis branch, where high availability is implemented on top of Apache Ratis (the production-grade Raft library that also powers Apache Ozone). The cluster is 5 Debian nodes in Docker, plus a control node running Leiningen and Jepsen 0.3.11. Each test gets a fresh cluster to eliminate cross-test contamination.

Six workloads

Workload	What it checks	Checker
bank	ACID balance conservation across 5 accounts during concurrent transfers	Custom conservation invariant
set	No acknowledged write is ever lost during replication	Custom set checker
elle	Transaction isolation: G0, G1a, G1b, G2, lost updates	Elle
register	Linearizability of single-key read/write/CAS, leader reads	Knossos
register-follower	Linearizability when reads are routed to a non-leader (ReadIndex path)	Knossos
register-bookmark	Read-your-writes via commit-index bookmarks on follower reads	Knossos

Seven nemeses

Nemesis	Description
`none`	Baseline, no faults
`partition`	Random network partitions via `iptables`
`kill`	`SIGKILL` random nodes (crash)
`pause`	`SIGSTOP`/`SIGCONT` random nodes (long GC pause)
`clock`	Random ±60s clock shifts via `date -s`
`all`	partition + kill + pause concurrently
`all+clock`	all + clock skew

The leader workloads run against 5 nemeses (we omit clock and all+clock because leader-only reads aren’t sensitive to follower clock drift). The follower workloads run the full 7. That’s 20 + 14 = 34 tests.

The Results

Figure 1. The 34-test matrix. Every executed cell passed.

Behind every green check is a 90-second run (30 seconds for the most expensive Knossos workloads) of concurrent client operations against the cluster while the chosen nemesis hammers the nodes. Then the checker takes the recorded history and either says :valid? true or hands you a counterexample.

The Faults, Visually

The interesting Jepsen tests aren’t the none baseline. They’re what happens when the cluster is being actively misbehaved. Here’s what we throw at the 5-node cluster.

Figure 2. The four primitive nemeses. The composite all and all+clock apply them concurrently.

What Each Workload Actually Proves

Passing 34 tests sounds nice in a header, but each workload is asking a specific question. Here’s what we’re actually claiming.

bank: ACID under partitions

Five accounts, 1000 each, total 5000. Concurrent clients transfer random amounts between random pairs of accounts inside multi-statement transactions. After every operation the checker sums the balances. The total must always equal 5000. If a transfer is partially applied (debit succeeds, credit fails, or vice versa), the sum drifts and the test fails. Under partitions, kills, pauses, and the combined all nemesis: conservation holds.

set: no acknowledged write is lost

Insert unique integers, periodically read them all back. Every integer for which the server returned a successful write must appear in subsequent reads. This is the cleanest test for replication completeness: it doesn’t matter how the cluster reorders things, only that nothing acknowledged is silently dropped. Zero lost writes across all five nemeses.

elle: real transaction isolation, checked by cycles

This is where we throw multi-key read/write transactions at the cluster and let Elle build the dependency graph. Elle then looks for cycles that correspond to specific anomalies: G0 (dirty write), G1a (read of an aborted write), G1b (read of an intermediate value), G2 (anti-dependency cycle), and lost updates. We exclude G1c because, in our HTTP-based harness, reads after commit happen as separate calls; that creates a test-implementation pattern that Elle correctly flags as a “circular information flow” but which doesn’t reflect a real isolation violation. Every other anomaly class: none observed.

register: leader-side linearizability

A single integer, hammered with concurrent reads, writes, and compare-and-swap operations, all routed to the Raft leader. Knossos then attempts to find a serial ordering of those operations consistent with each client’s observed responses. Knossos is brutal: it’ll happily spend minutes searching, and if your “linearizable” register isn’t, it’ll tell you exactly which interleaving breaks. All four executed nemeses certified linearizable.

register-follower: linearizability when reads go to a follower

Writes still go to the leader, but reads are deliberately routed to a non-leader with the X-ArcadeDB-Read-Consistency: LINEARIZABLE header. This exercises the ReadIndex path on followers (RaftHAServer.ensureLinearizableFollowerRead()): the follower issues sendReadOnly() to the leader, the leader confirms it still holds quorum and returns its current commit index, the follower waits for its local state machine to catch up, then serves the read. Without that round-trip, a lagging follower would serve stale data and Knossos would catch it instantly. With it: linearizable across all 7 nemeses, including clock skew and all+clock.

register-bookmark: read-your-writes via commit-index bookmarks

Same follower-read setup, but instead of a full ReadIndex round-trip on every read, the client captures X-ArcadeDB-Commit-Index from each write response and echoes it back as X-ArcadeDB-Read-After on subsequent reads. The follower waits for its local apply to reach that index before serving. This is cheaper than ReadIndex but only guarantees read-your-writes for the issuing client, not global linearizability across clients. All 7 nemeses pass.

The two follower modes matter because most real applications don’t need global linearizability, they need their own writes to be visible to their own subsequent reads. The bookmark path gives that property at much lower cost than ReadIndex.

How read consistency works in ArcadeDB

The follower-read tests are the most novel piece, and they map directly to a configurable knob in the database:

Level	Performance	Consistency	Use case
`eventual`	Fastest	May read stale data on followers	Analytics, dashboards
`read_your_writes` (default)	Fast	Leader reads from local DB; followers wait for client’s last write	Most OLTP workloads
`linearizable`	+1 RTT when lease expired	Full linearizability even under process pauses	Financial transactions, coordination

You set it globally via arcadedb.ha.readConsistency or per request via the X-ArcadeDB-Read-Consistency HTTP header. The Jepsen runs use linearizable for the follower workloads (the most demanding setting) and the default read_your_writes for the leader workloads.

In linearizable mode, the leader checks its Raft lease before every read via Ratis’s sendReadOnly() API (Section 6.4 of the Raft paper). When the lease is valid (the common case), this is a local timestamp check with no network round-trip. When the lease has expired (e.g., after a long VM suspend or extreme GC pause), Ratis sends heartbeats to a majority before serving the read. About 1 extra RTT in the worst case, which is exactly the cost you’d expect for a correctness guarantee under arbitrary process pauses.

Beyond Jepsen: the broader HA test suite

The 34 Jepsen tests are the external validation layer, but they sit on top of an in-house suite that runs on every commit to the apache-ratis branch.

The new Raft-based HA layer ships with 81 dedicated test classes and over 327 individual test cases, split between 33 unit tests and 48 end-to-end integration scenarios. The suite exercises every corner of the consensus protocol:

Leader election and failover (clean shutdown, dirty kill, leadership transfer)
2-, 3-, and 5-node replication topologies
Split-brain recovery (deliberately partition the cluster, then heal and verify convergence)
Dynamic cluster membership (add/remove nodes while the cluster is taking writes)
Snapshot install, swap, and throttling
Leader crashes between commit phases (no acknowledged write is lost)
Follower catch-up from WAL and from snapshot
Schema replication (DDL changes propagate atomically)
Read-your-writes consistency across the cluster
Concurrent HTTP and gRPC traffic under load

Failure-injection tests intentionally crash leaders, partition replicas, and corrupt snapshots to verify the cluster heals itself without data loss. Jepsen then adds the formal-checker layer (Knossos and Elle) that the in-house suite can’t easily replicate.

Reproduce it yourself

The full test suite is open source and Apache 2.0 licensed:

github.com/ArcadeData/arcadedb-jepsen

The repository includes the Docker setup, all six workloads, the nemesis implementations, and the run-all-tests.sh script that reproduces the entire 34-test sweep on your own hardware. A full sweep takes about 60 minutes on a modern laptop.

git clone https://github.com/ArcadeData/arcadedb-jepsen
cd arcadedb-jepsen
./build-local.sh /path/to/your/arcadedb
cd docker && docker compose up -d
docker exec jepsen-control sh /jepsen/docker/setup-ssh.sh
./run-all-tests.sh 90

Inspect the recorded histories, the Knossos and Elle outputs, the timeline plots: everything Jepsen produces is in store/ after each run.

What we did not test

Honest disclosure matters more than the green checkmarks, so here’s what these 34 tests do not cover:

Long-duration runs. Each nemesis combination ran on the order of minutes, not hours. Slow-burn anomalies (memory leaks, file-handle exhaustion, Raft log compaction edge cases that only surface after millions of entries) are out of scope.
Disk corruption, fsync lying, and Byzantine faults. We assume the kernel honors fsync() and that nodes are non-malicious. We do not inject bit-flips, truncate WAL files, or simulate filesystems that ack writes without persisting.
Geo-replication scenarios. All five nodes live in the same Docker network with single-digit-millisecond latencies. We have not tested cross-region links, asymmetric latency, or sustained high jitter.
Compounded worst-case for follower reads. We exercised expired Raft lease, clock skew, and partitions individually (and clock + partition + kill + pause together via all+clock), but we did not run the specific stack of expired lease + clock skew + active partition simultaneously against the linearizable follower-read path.

Some of these (longer runs, Byzantine fsync, geo-replication) are on the roadmap. Others (true Byzantine resilience) are explicitly out of scope for a CFT (crash-fault-tolerant) Raft system. If you think any of these should be in the next pass, open an issue or send a PR.

Help us break it

We’re publishing this for two reasons.

One: we want the upcoming Ratis-based HA release to be the most thoroughly tested HA stack ArcadeDB has ever shipped. Internal tests pass; that’s the floor, not the ceiling.

Two: we’d love independent scrutiny. We’re open to PRs that add workloads, tighter checkers, more aggressive nemeses, or just better failure modes we haven’t thought of. If you find a real linearizability violation, a lost write, or an isolation anomaly, please open an issue. And Kyle, if you ever want to run a real Jepsen analysis on ArcadeDB, our doors are wide open. We’d love to read it. Even if (especially if) it turns up things our in-house tests missed.

Until then: 34 tests in, 34 tests passed, every line of the framework and every line of the test suite open for your inspection.