ArcadeDB

Building Memory for LLM Agents with ArcadeDB

2026-07-30T00:00:00+00:00

An agent that cannot remember is a chatbot with extra steps. This post shows how to build persistent memory for an LLM agent on a knowledge graph: storing episodes and the entities they mention, keeping facts current without losing their history, and recalling context with vector similarity and graph traversal in one query.

Everything here runs in-process with pip install arcadedb-embedded. No server, no separate vector database.

What Agent Memory Actually Is

“Memory” gets used loosely. In practice an agent needs three different things, and only two of them belong in a database.

Working memory is the context window: the current conversation, tool outputs, and scratchpad. It lives in the prompt and disappears when the request ends.

Episodic memory is what happened. Turn by turn, session by session: the user asked X, the agent called tool Y, the result was Z. It is append-only and inherently temporal.

Semantic memory is what the agent has learned. Durable facts about entities: this user prefers Postgres, this project depends on that service, this customer is on the enterprise plan. Facts change, and the changes matter.

The common mistake is to implement all three as a single vector index over conversation transcripts. That works for the first demo and degrades quickly.

Why a Vector Store Alone Is Not Enough

Embedding every turn and retrieving the top-k nearest is a reasonable baseline. It fails in four specific ways, and all four are relationship problems.

Superseded facts. The user said they work at Acme in March and at Globex in June. Both turns are in the index, both are semantically similar to “where does the user work”, and the retriever has no notion that one replaced the other. The agent confidently returns the wrong employer. A vector index has no concept of validity.

Entity identity. “Ana”, “Ana Ruiz”, and “a.ruiz@example.com” are the same person. Cosine similarity does not know that. Without entity resolution, memory fragments across aliases and recall silently degrades as the number of sessions grows.

Structural questions. “What has this user asked about the billing service?” is a filter over a relationship, not a similarity search. A vector index can only approximate it, and approximation is exactly the wrong tool when the answer is deterministic.

Multi-hop recall. “Which of the user’s projects depends on the library that just broke?” requires traversing project to dependency to incident. No single stored turn contains that chain.

A knowledge graph answers all four directly, because it stores the relationships instead of hoping similarity implies them. What it is not good at on its own is fuzzy recall, which is why the useful design keeps both, in one engine.

The Schema

Four vertex types and three edge types are enough to start.

import arcadedb_embedded as arcadedb

DIM = 1536  # match your embedding model

db_ctx = arcadedb.create_database("./agent-memory")

with db_ctx as db:
    with db.transaction():
        # An episode is one thing that happened
        db.command("sql", "CREATE VERTEX TYPE Episode")
        db.command("sql", "CREATE PROPERTY Episode.content STRING")
        db.command("sql", "CREATE PROPERTY Episode.session STRING")
        db.command("sql", "CREATE PROPERTY Episode.ts DATETIME")
        db.command("sql", "CREATE PROPERTY Episode.embedding LIST OF FLOAT")

        # Entities the agent has learned about
        db.command("sql", "CREATE VERTEX TYPE Entity")
        db.command("sql", "CREATE PROPERTY Entity.name STRING")
        db.command("sql", "CREATE PROPERTY Entity.kind STRING")
        db.command("sql", "CREATE INDEX ON Entity (name) UNIQUE")

        # A durable fact, valid over an interval
        db.command("sql", "CREATE VERTEX TYPE Fact")
        db.command("sql", "CREATE PROPERTY Fact.predicate STRING")
        db.command("sql", "CREATE PROPERTY Fact.value STRING")
        db.command("sql", "CREATE PROPERTY Fact.valid_from DATETIME")
        db.command("sql", "CREATE PROPERTY Fact.valid_to DATETIME")

        db.command("sql", "CREATE EDGE TYPE MENTIONS")   # Episode -> Entity
        db.command("sql", "CREATE EDGE TYPE ASSERTS")    # Episode -> Fact
        db.command("sql", "CREATE EDGE TYPE ABOUT")      # Fact    -> Entity

        # HNSW index for semantic recall over episodes
        db.command("sql",
            f"CREATE INDEX ON Episode (embedding) LSM_VECTOR "
            f"METADATA {{ dimensions: {DIM}, similarity: 'COSINE' }}")

The separation matters: an Episode is immutable and says what was observed, a Fact is the interpretation, and the ASSERTS edge records which episode produced it. When the agent later gets something wrong, you can trace the belief back to the turn that caused it.

Writing a Turn to Memory

Each turn does three things: store the episode with its embedding, resolve the entities it mentions, and assert or update any facts.

from datetime import datetime, timezone

def remember(db, session, text, entities, facts):
    """Persist one turn: episode, entities, and any facts it asserts."""
    now = datetime.now(timezone.utc).isoformat()

    with db.transaction():
        db.command("sql",
            "CREATE VERTEX Episode SET content = ?, session = ?, "
            "ts = ?, embedding = ?",
            text, session, now, embed(text))

        for name, kind in entities:
            # UPSERT gives one vertex per distinct entity name
            db.command("sql",
                "UPDATE Entity SET name = ?, kind = ? UPSERT WHERE name = ?",
                name, kind, name)
            db.command("sql",
                "CREATE EDGE MENTIONS "
                "FROM (SELECT FROM Episode WHERE content = ? AND session = ?) "
                "TO (SELECT FROM Entity WHERE name = ?)",
                text, session, name)

        for subject, predicate, value in facts:
            assert_fact(db, subject, predicate, value, now, source=text,
                        session=session)

Keeping Facts Current Without Losing History

This is the part a vector store cannot do at all. When a new value arrives for a predicate that already has one, do not overwrite it: close the old fact and open a new one.

def assert_fact(db, subject, predicate, value, now, source, session):
    """Close any currently-valid fact for this predicate, then open a new one."""
    # 1. Invalidate the previous value, if any
    db.command("sql",
        "UPDATE Fact SET valid_to = ? "
        "WHERE predicate = ? AND valid_to IS NULL "
        "AND @rid IN (SELECT in('ABOUT').@rid FROM Entity WHERE name = ?)",
        now, predicate, subject)

    # 2. Record the new value
    db.command("sql",
        "CREATE VERTEX Fact SET predicate = ?, value = ?, valid_from = ?",
        predicate, value, now)
    db.command("sql",
        "CREATE EDGE ABOUT "
        "FROM (SELECT FROM Fact WHERE predicate = ? AND value = ? "
        "      AND valid_to IS NULL) "
        "TO (SELECT FROM Entity WHERE name = ?)",
        predicate, value, subject)
    db.command("sql",
        "CREATE EDGE ASSERTS "
        "FROM (SELECT FROM Episode WHERE content = ? AND session = ?) "
        "TO (SELECT FROM Fact WHERE predicate = ? AND value = ? "
        "    AND valid_to IS NULL)",
        source, session, predicate, value)

Now “where does the user work?” has one right answer, and “where did the user work in April?” is still answerable. Nothing was destroyed.

Recall: Similarity, Then Structure

Retrieval runs in two layers. Semantic recall finds episodes that resemble the question; structural recall pulls the facts that are currently true.

def recall(db, question, k=5):
    q = embed(question)

    # 1. Episodes that resemble the question, plus what they mention
    episodes = db.query("sql",
        "SELECT content, ts, distance, out('MENTIONS').name AS entities "
        "FROM ( SELECT expand(vector.neighbors('Episode[embedding]', ?, ?)) )",
        q, k)

    # 2. Facts that are currently true about those entities
    facts = db.query("sql",
        "SELECT predicate, value, out('ABOUT').name AS subject "
        "FROM Fact WHERE valid_to IS NULL "
        "AND out('ABOUT').name IN ?",
        [e for row in episodes for e in (row.get("entities") or [])])

    return episodes, facts

The second query is the one that makes the agent reliable. It returns only facts whose valid_to is null, so superseded beliefs cannot leak back into the prompt no matter how semantically similar the old episode was.

Wiring It Into a Framework

If you already drive your agent through a framework, you do not need to hand-roll the retrieval layer.

LangChain. The official langchain-arcadedb package provides ArcadeDBGraph, which implements LangChain’s GraphStore protocol and connects over ArcadeDB’s Bolt protocol.

from langchain_arcadedb import ArcadeDBGraph

graph = ArcadeDBGraph(
    url="bolt://localhost:7687",
    username="root",
    password="playwithdata",
    database="agent-memory",
)
print(graph.get_schema)

LlamaIndex. llama-index-graph-stores-arcadedb implements the LlamaIndex PropertyGraphStore interface.

MCP. ArcadeDB ships an MCP server, so an LLM can query the memory graph directly as a tool. See connecting your LLM to your database with MCP.

Note that the LangChain path needs a running server with the Bolt plugin enabled, whereas the embedded path above needs no server at all. Both talk to the same storage format.

Why This Runs Well

Recall sits in the agent’s request path, so traversal speed is not an academic concern. On the LDBC Graphalytics suite, run on identical hardware against seven other graph engines, ArcadeDB completes weakly connected components in 0.08s and PageRank in 0.10s, winning 5 of the 6 standard algorithms. On the LSQB pattern-matching benchmark, Q6, a two-hop traversal counting 1.67 billion rows, finishes in 110ms. Full methodology and the reproducible harness are on the benchmarks page.

Embedded mode removes the network entirely, which for a single-process agent is usually the largest remaining latency term.

Honest Comparison

ArcadeDB is not the only option that stores graph and vectors together. Memgraph and FalkorDB both index embeddings in the same store as the graph and both target agent memory explicitly, and either is a reasonable choice.

The differences worth weighing are licensing and breadth. Memgraph Community is BSL 1.1 and FalkorDB is SSPLv1, both source-available rather than OSI open source; ArcadeDB is Apache 2.0. ArcadeDB also stores documents and time series alongside the graph, and runs embedded in your Python or JVM process rather than only as a server. The full side-by-side is in open source knowledge graph and GraphRAG databases compared.

Frequently Asked Questions

What is memory for an LLM agent?

Agent memory is the state an agent keeps between turns and between sessions. It usually splits into episodic memory (what happened, turn by turn), semantic memory (durable facts learned about entities), and working memory (the current context window). Only the first two need a database.

Why is a vector store alone not enough for agent memory?

A vector store retrieves passages that resemble the query, but it cannot express that a fact was superseded, that two names refer to one person, or that a preference belongs to a project rather than a user. Those are relationships and validity intervals, which is what a graph stores.

How do you stop an agent from recalling outdated facts?

Do not overwrite facts; invalidate them. Give each fact edge a valid_from and valid_to timestamp, and when a new value arrives, close the old edge instead of deleting it. Retrieval then filters to edges where valid_to is null, and the history stays auditable.

Can ArcadeDB store agent memory and the embeddings together?

Yes. Embeddings live in a property on the same records that carry the graph, indexed with HNSW. A single transaction writes an episode, its vector, and its relationships, so semantic recall and relationship traversal run against one consistent copy of the data.

Does ArcadeDB work with LangChain for agent memory?

Yes. The official langchain-arcadedb package provides ArcadeDBGraph, which implements the LangChain GraphStore protocol and connects over ArcadeDB’s Bolt protocol. There is also a LlamaIndex PropertyGraphStore integration, and an MCP server that lets an LLM query the database directly.

Can agent memory run without a database server?

Yes. The arcadedb-embedded Python package runs the full engine inside your agent process, with a bundled Java runtime. For a single-process agent or a local assistant, that removes the server, the network hop, and the deployment entirely while keeping graph and vector recall.

Getting Started

pip install arcadedb-embedded

That is the whole install: the wheel bundles a Java runtime, so there is no JDK to set up. From there, the schema above is a working starting point you can adapt.

For the retrieval patterns behind this, see GraphRAG on ArcadeDB. For entity extraction and knowledge graph construction, see knowledge graphs. ArcadeDB is Apache 2.0 and free in production, with no node limits and no enterprise edition.

Open Source Knowledge Graph & GraphRAG Databases Compared (2026)

2026-07-30T00:00:00+00:00

The open source knowledge graph and GraphRAG databases worth evaluating in 2026 are ArcadeDB, Neo4j Community Edition, Memgraph, FalkorDB, JanusGraph, TerminusDB, Cayley, and Kuzu. Below we compare each on licensing, maintenance status, query languages, and whether it can serve GraphRAG retrieval, meaning graph traversal plus vector similarity, without a second database bolted alongside it.

The short version: three of the eight index vectors in the same store as the graph, and can therefore serve GraphRAG on their own: ArcadeDB, Memgraph, and FalkorDB. They differ mainly on licence. ArcadeDB is Apache 2.0, Memgraph Community is BSL 1.1, FalkorDB is SSPLv1. Only ArcadeDB is OSI open source, and only ArcadeDB also stores documents and time series and runs embedded.

A word on where this is published. This is the ArcadeDB blog, we build ArcadeDB, and we think it is the best fit for most knowledge graph projects in 2026. You should discount our opinion accordingly. What we can do is be precise about the things that are checkable: which licence each project uses, when it last shipped a release, and what it does not do. Every status claim in this article was verified against the project’s own repository on 30 July 2026, and we say plainly where an alternative is the better choice.

Two of the eight are not really live options any more, and we say so rather than padding the list.

What Makes a Knowledge Graph Database?

A knowledge graph is not just a graph. It is a graph where the nodes represent real entities (people, documents, products, concepts), the edges carry typed meaning (authored-by, cites, depends-on), and the whole thing is used to answer questions that require following those relationships rather than filtering a table.

That imposes requirements beyond “can store nodes and edges”:

Multi-hop traversal at usable speed. The value of a knowledge graph is in the questions that cross three or four relationships. If those queries take minutes, nobody asks them.
Semantic retrieval. Users do not know your terminology. Someone searching for “how to handle errors in the payment service” needs to find a document titled “Exception Management in Billing Module”. That requires vector embeddings, not keyword matching.
Exact retrieval too. Sometimes the query is a specific error code or clause number, where semantic similarity is exactly wrong and you need literal full-text matching.
A licence you can actually build on. If you plan to embed the database in a product you distribute, a copyleft licence is a business decision, not a footnote.
Someone still maintaining it. A knowledge graph is long-lived infrastructure. Adopting an unmaintained engine means you have adopted its unfixed bugs permanently.

That last criterion eliminates more candidates than people expect.

1. ArcadeDB

Licence: Apache 2.0 · Latest release: 26.7.3 (July 2026) · Status: actively developed

ArcadeDB is a multi-model database that stores graph, document, key-value, full-text search, vector, and time-series data in one engine, under a single ACID transaction boundary.

Why It Stands Out

For knowledge graph work specifically, the argument is that a knowledge graph needs three retrieval modes and most stacks make you run three databases to get them.

Graph, vectors, and full-text in one query. A realistic knowledge graph query does several things at once: find the entities semantically related to a question, filter to those matching an exact identifier, then traverse out to everything connected. On ArcadeDB that is one statement against one copy of the data. On most alternatives it is a graph query, a separate vector search, and an application layer merging the results, with the consistency between the two stores becoming your problem.

Native vector search. ArcadeDB indexes embeddings with JVector, using DiskANN and HNSW with SIMD acceleration, over the same records that hold the graph. There is no plugin to install and no external service to keep in sync.

Five query languages. SQL, OpenCypher 25, Apache TinkerPop Gremlin, GraphQL, and the MongoDB query language all run against the same data. The Cypher engine is native rather than a translation layer and passes 97.8% of the official TCK, which matters if you are moving existing queries across.

Apache 2.0, with no edition split. Clustering with Raft consensus, replication, embedded mode, and vector search are all in the free build. There is no Enterprise edition holding back the features you need in production, and no node or core limit.

Embedded or server. ArcadeDB runs inside your JVM, or in Python in-process via pip install arcadedb-embedded with a bundled JRE. For a knowledge graph feeding a local RAG pipeline, that removes the server entirely.

Where It’s Not the Best Fit

Honesty requires a real list here, not a token one.

It is the smallest community of the six. ArcadeDB has roughly 1,055 GitHub stars against Neo4j’s enormous ecosystem, Cayley’s 15,000, and JanusGraph’s 5,800. That translates into fewer Stack Overflow answers, fewer blog posts when you hit an edge case, and fewer engineers who already know it. If hiring people who have used your database before is a hard requirement, this is a genuine argument against us.

It is not a triplestore. If your knowledge graph is RDF, your data is in Turtle or N-Triples, and your team writes SPARQL, ArcadeDB is the wrong shape. It is a property graph engine. Use a dedicated RDF store.

It does not shard a single graph across machines. ArcadeDB replicates for availability rather than partitioning one graph across a cluster. If your graph genuinely exceeds what one machine can hold, JanusGraph over Cassandra is the more honest answer.

It runs on the JVM. The bundled-JRE Python package hides this well, but the engine is Java. If your operational policy excludes the JVM, that is decisive.

2. Neo4j Community Edition

Licence: GPLv3 · Status: actively developed

Neo4j is the reference point for property graphs, and Community Edition is a genuinely capable database. It is also the option most often adopted without reading the licence.

The Good

The ecosystem is unmatched. More documentation, more tutorials, more courses, more consultants, and more people who already know Cypher than every other option on this list combined. For a team learning graph modelling from scratch, that is worth a great deal.

Cypher is excellent. Neo4j designed the language that the rest of the industry now implements. It is genuinely pleasant for pattern matching.

Full ACID and a mature engine. Community Edition is not a crippled demo. For single-instance workloads it is a solid production database.

The Problems

GPLv3 is a business decision. Community Edition is copyleft. If you distribute software that includes or links to it, the licence reaches your application. For internal deployments this is usually irrelevant; for anything you ship to customers it is often a blocker, and the escape hatch is a commercial Enterprise licence.

No clustering. Community Edition does not support clustering, so it is limited to single-instance deployments. High availability and read scaling are Enterprise features. For a knowledge graph that becomes load-bearing infrastructure, that ceiling arrives eventually.

Graph only. Neo4j stores graphs. Documents, time series, and the rest live in other systems, which for knowledge graph work usually means running a separate store for the source documents alongside the graph of extracted entities.

When to Choose It Anyway

If your team already knows Cypher, your deployment is internal and single-instance, and you value ecosystem depth over licence flexibility, Neo4j Community Edition is a reasonable and low-risk choice. Plenty of successful knowledge graphs run on it.

3. Memgraph

Licence: BSL 1.1 (Community) · Latest release: v3.12.0 (July 2026) · Status: actively developed

Memgraph is an in-memory, Cypher-compatible graph database written in C++, and it has aimed squarely at GraphRAG and agent memory. If you have asked an AI assistant about GraphRAG recently, there is a good chance it cited Memgraph.

The Good

Built for this workload. Memgraph ships text and vector indexes in the same store as the graph, so a retrieval pipeline can run graph traversal and similarity search as one database operation rather than two systems stitched together. This is the same architectural argument we make for ArcadeDB, and it is a fair one when they make it.

In-memory speed. Holding the working set in memory gives Memgraph excellent latency on traversals, which matters when retrieval sits in the request path of an agent loop.

Cypher compatible. Existing Cypher and Neo4j-shaped tooling largely carries over.

The Problems

BSL is not open source. Memgraph Community is licensed under the Business Source License 1.1, which the Open Source Initiative does not recognise as an open-source licence. It restricts commercial use in ways a permissive licence does not, and Enterprise sits behind a separate proprietary licence. If your reason for avoiding Neo4j was licensing, read the BSL carefully before treating Memgraph as the escape.

In-memory is a cost model. RAM is the constraint. For a knowledge graph that is large and mostly cold, keeping it resident is a different budget from a disk-based engine.

Graph plus vectors, not multi-model. Documents and time series live elsewhere.

When to Choose It Anyway

If latency is your dominant constraint, your graph fits comfortably in memory, and BSL is acceptable to your legal team, Memgraph is genuinely strong at exactly this workload. It is the most direct competitor to ArcadeDB on GraphRAG and we would rather say so than pretend otherwise.

4. FalkorDB

Licence: SSPLv1 · Latest release: v4.20.1 (July 2026) · Status: actively developed

FalkorDB is the continuation of RedisGraph after Redis discontinued it, rebuilt around GraphBLAS sparse adjacency matrices and aimed explicitly at knowledge graphs for LLMs.

The Good

GraphBLAS is a genuinely good fit. Representing the graph as sparse matrices turns multi-hop traversal into linear algebra, and it makes FalkorDB fast on the pattern-matching workloads GraphRAG generates.

HNSW vector index in the same store. Like Memgraph, FalkorDB indexes embeddings alongside the graph, so hybrid retrieval does not require a second database.

Cypher, with a low-friction operational model. It runs as a Redis module, which is familiar territory for a lot of teams.

The Problems

SSPL is source-available, not open source. The Server Side Public License restricts how you may offer the software as a service and is not an OSI-approved open-source licence. For most self-hosted users this is not a practical constraint, but it is not the same freedom Apache 2.0 gives you, and it should not be described as open source.

Graph and vectors only. Same limitation as Memgraph: no document or time-series model.

Redis-module operational shape. Convenient if you already run Redis, an additional dependency if you do not.

When to Choose It Anyway

If you want fast GraphRAG retrieval, you already operate Redis, and SSPL is acceptable, FalkorDB is a well-built option with a clear focus on exactly this use case.

5. JanusGraph

Licence: Apache 2.0 · Latest release: 1.1.0 (November 2024) · Status: maintained, slow release cadence

JanusGraph is a distributed graph database under the Linux Foundation, and the direct descendant of Titan. It is the option built for graphs that genuinely do not fit on one machine.

The Good

Real horizontal scale. JanusGraph is a graph layer over a distributed storage backend, shipping support for Apache Cassandra, Apache HBase, and Oracle Berkeley DB Java Edition. If you already run Cassandra at scale, JanusGraph inherits that operational maturity and partitions a graph across it in a way most single-node engines cannot.

Genuinely open, genuinely neutral. Apache 2.0, under the Linux Foundation, with no vendor holding an Enterprise edition back. There is no commercial upsell waiting.

TinkerPop native. Gremlin is the query language, and JanusGraph is a first-class TinkerPop implementation, so the wider Gremlin tooling ecosystem works.

The Problems

You are operating at least two distributed systems. JanusGraph is not a complete database. Production means running and tuning Cassandra or HBase underneath it, plus usually Elasticsearch or Solr for indexing. The operational burden is substantially higher than any single-binary option here, and that is the dominant cost for most teams.

Slow release cadence. Version 1.0.0 arrived in October 2023 and 1.1.0 in November 2024. The repository is active, but a project shipping roughly one release a year is a different maintenance proposition from one shipping monthly.

Gremlin only, and no native vectors. There is no Cypher and no SQL. For semantic search you will pair it with a separate vector database, which for knowledge graph work means the multi-store architecture again.

When to Choose It Anyway

If your graph is genuinely too large for one machine and you already operate Cassandra or HBase, JanusGraph is the right answer and we would tell you so. That is a real scenario and no amount of single-node performance changes it.

6. TerminusDB

Licence: Apache 2.0 · Latest release: 12.0.6 (June 2026) · Status: actively developed

TerminusDB is a document graph database whose distinguishing idea is git-style version control for data: branching, merging, diffing, and time travel over the database itself.

The Good

Versioning is a first-class feature, not a pattern you implement. For a knowledge graph curated by humans, being able to branch the graph, make changes, review a diff, and merge is genuinely valuable and nothing else on this list offers it natively. Regulatory and scientific use cases where you must prove what the data said last quarter are a natural fit.

Apache 2.0 and actively shipping. Releases are current, the repository saw commits the week this was written, and the licence is permissive.

Document plus graph model. Data is modelled as documents with a schema, connected as a graph, which fits knowledge graph work better than a pure triplestore for most application developers.

The Problems

Smaller ecosystem than its age suggests. Roughly 3,370 GitHub stars and a modest contributor base. Documentation and community answers are thinner than Neo4j or JanusGraph.

No native vector search. TerminusDB does not index embeddings alongside the graph, so semantic retrieval means a separate vector store.

WOQL is a learning curve. The native query language is its own thing rather than Cypher, SQL, or Gremlin, so existing team knowledge does not transfer.

When to Choose It Anyway

If versioning and provenance are the primary requirement rather than a nice-to-have, TerminusDB is the strongest option here and the comparison is not close.

7. Cayley

Licence: Apache 2.0 · Latest tagged release: v0.7.7 (October 2019) · Status: effectively dormant

Cayley is a linked-data graph database written in Go, inspired by the graph infrastructure behind Google’s Knowledge Graph. It has 15,045 GitHub stars, which is more than anything else on this list, and it is the clearest illustration of why star counts are a poor proxy for project health.

The Good

The design is genuinely nice. A Go binary with pluggable backends, a clean HTTP API, and a small footprint. For a read-mostly linked-data store it is pleasant to work with.

Permissively licensed with no vendor. Apache 2.0, no commercial edition, no strings.

The Problems

No tagged release since October 2019. That is the single fact that matters. The repository is not archived and there is occasional commit activity, but a database that has not cut a release in nearly seven years is not something to build a knowledge graph on in 2026. Bug fixes, security patches, and dependency updates are not arriving on any schedule you can rely on.

No vector search, no modern AI integration. Cayley predates the entire embedding era and has no answer for semantic retrieval.

Query languages are its own. Gizmo, a JavaScript-based query API, alongside GraphQL and MQL variants.

Verdict

We would not start a new knowledge graph on Cayley in 2026, and we would say the same if it were our own project. Included here because its star count keeps it near the top of search results, and readers deserve to know that the number reflects 2015 enthusiasm rather than 2026 maintenance.

8. Kuzu

Licence: MIT · Latest release: v0.11.3 (October 2025) · Status: archived

Kuzu was an embedded analytical graph database with a columnar storage engine and a Cypher interface, and it was genuinely good at what it did.

What Happened

The GitHub repository was archived in October 2025 following the team’s acquisition by Apple. Development stopped. The code remains available under MIT, and community forks exist, but there is no funded maintainer and no release schedule.

Strengths, While It Lasted

Excellent analytical performance. Kuzu’s columnar engine was fast on multi-hop analytical queries, and it still appears in benchmark comparisons, including our own LDBC results, where it is marginally faster than ArcadeDB on LSQB Q2. We are not going to pretend otherwise.

Embedded and Python-first. It brought embedded graph workloads to Python properly, which is exactly the niche arcadedb-embedded now occupies.

Limitations

It is archived. Everything else is secondary. An archived database is not a foundation for new infrastructure, however good the engine was.

MIT means the code survives. If you already run Kuzu, you are not stranded, and forking is legally straightforward. But you now own a database engine, which is a larger commitment than most teams intend to make.

The Comparison at a Glance

	ArcadeDB	Neo4j CE	Memgraph	FalkorDB	JanusGraph	TerminusDB	Cayley	Kuzu
Licence	Apache 2.0	GPLv3	BSL 1.1	SSPLv1	Apache 2.0	Apache 2.0	Apache 2.0	MIT
OSI open source	Yes	Yes	No	No	Yes	Yes	Yes	Yes
Status (Jul 2026)	Active	Active	Active	Active	Maintained	Active	Dormant	Archived
Latest release	26.7.3	Current	v3.12.0	v4.20.1	1.1.0 (2024)	12.0.6	v0.7.7 (2019)	v0.11.3 (2025)
Vectors in-store	Yes	Partial	Yes	Yes	No	No	No	No
Serves GraphRAG alone	Yes	Partial	Yes	Yes	No	No	No	No
Beyond graph	Doc, KV, TS, FTS	No	No	No	No	Doc	No	No
Clustering free	Yes	No	Yes	Yes	Yes	Yes	n/a	n/a
Query languages	5	Cypher	Cypher	Cypher	Gremlin	WOQL	Gizmo	Cypher
Embedded mode	Yes	No (Enterprise)	No	No	No	No	Yes	Yes
Separate storage backend	No	No	No	Redis module	Required	No	Optional	No
GitHub stars	1,055	Very large	4,293	4,853	5,816	3,370	15,045	4,028

Star counts and release data verified against each project’s GitHub repository on 30 July 2026.

Why Licensing Matters for Knowledge Graphs

Knowledge graphs have a licensing problem that other databases do not, because they tend to end up embedded in products rather than sitting behind a service boundary.

A knowledge graph that powers an internal search tool is a service you deploy, and GPLv3 is largely irrelevant. A knowledge graph that ships inside a desktop application, an on-premise product, or a customer-installed agent is distributed software, and a copyleft licence reaches your code. Teams frequently discover this after the architecture is settled.

Apache 2.0, used by ArcadeDB, JanusGraph, TerminusDB, and Cayley, permits embedding in proprietary software with no source obligation. MIT, used by Kuzu, is similarly permissive. GPLv3, used by Neo4j Community Edition, does not, and the commercial escape hatch is priced accordingly.

The second licensing question is what the free edition withholds. Neo4j reserves clustering and embedded mode for Enterprise. ArcadeDB, JanusGraph, and TerminusDB do not have a paid edition holding features back at all.

So Which One Should You Choose?

Choose ArcadeDB if you want graph traversal, vector search, and full-text retrieval in one engine under a permissive licence, and you would rather not operate three databases to build one knowledge graph. Accept that you are picking the smallest community here.

Choose Neo4j Community Edition if ecosystem depth matters more than licence flexibility, your deployment is internal and single-instance, and your team already writes Cypher.

Choose Memgraph if retrieval latency is the dominant constraint, your graph fits in memory, and your legal team is comfortable with BSL 1.1.

Choose FalkorDB if you want fast GraphRAG retrieval, already operate Redis, and SSPL is acceptable.

Choose JanusGraph if your graph genuinely does not fit on one machine and you already operate Cassandra or HBase. Do not choose it to avoid that operational burden, because it adds to it.

Choose TerminusDB if data versioning, branching, and provenance are core requirements rather than conveniences.

Do not start on Cayley in 2026, despite the star count, unless you are prepared to maintain it.

Do not start on Kuzu, because it is archived, unless you are deliberately adopting a fork and accepting ownership.

Key Takeaways

Six of these eight are actively maintained. Cayley has not cut a release since 2019, and Kuzu was archived in October 2025.
GitHub stars measure historical enthusiasm, not current health. Cayley has the most stars and the least maintenance.
Licensing determines whether you can embed the database in a distributed product. Apache 2.0 and MIT permit it; GPLv3 constrains it.
GraphRAG needs graph traversal and vector similarity together. Three of these eight do both in one store: ArcadeDB, Memgraph, and FalkorDB. The other five expect a separate vector database.
The GraphRAG shortlist comes down to licence: ArcadeDB is Apache 2.0, Memgraph is BSL, FalkorDB is SSPL. Only ArcadeDB is OSI open source, and only ArcadeDB is multi-model and embeddable.
ArcadeDB’s genuine weakness is community size: it has the fewest GitHub stars of the eight.

Frequently Asked Questions

What is the best open source knowledge graph database in 2026?

It depends on what you already run. ArcadeDB, Memgraph, and FalkorDB all index vectors alongside the graph and can serve GraphRAG alone; they differ on licence, with only ArcadeDB being OSI open source. JanusGraph is better at scale on Cassandra or HBase, and TerminusDB is better when versioning matters most.

Which open source database is best for GraphRAG?

GraphRAG needs graph traversal and vector similarity in the same query. ArcadeDB, Memgraph, and FalkorDB all index embeddings in the same store as the graph, so any of the three can serve GraphRAG without a separate vector database. ArcadeDB is the only one of the three under an OSI-approved open source licence.

Is Neo4j Community Edition suitable for a production knowledge graph?

It can be, with two caveats. Community Edition is GPLv3, which is a copyleft licence that affects how you can distribute software built on it, and it does not support clustering, so it is limited to single-instance deployments. Both are Enterprise Edition features.

Which open source knowledge graph databases are still actively maintained?

As of July 2026, ArcadeDB, Memgraph, FalkorDB, TerminusDB, JanusGraph, and Neo4j all ship releases. Kuzu was archived on GitHub in October 2025 after its team was acquired by Apple. Cayley’s repository is not archived but its last tagged release, v0.7.7, dates from October 2019.

Do I need a separate vector database for a knowledge graph?

Not necessarily. A knowledge graph that also serves semantic search needs both relationships and embeddings. ArcadeDB indexes vectors natively with JVector alongside the graph, so a single query can rank by similarity and traverse edges. Most alternatives require pairing the graph with a separate vector store.

What is the difference between a knowledge graph and an RDF triplestore?

A triplestore models data strictly as subject-predicate-object triples and typically queries with SPARQL. A property graph attaches arbitrary properties to nodes and edges and queries with Cypher, Gremlin, or SQL. Both can express a knowledge graph; property graphs are usually easier for application developers.

Does JanusGraph need Cassandra or HBase to run?

JanusGraph is a graph layer rather than a complete storage engine, so it runs on top of a separate backend. It ships support for Apache Cassandra, Apache HBase, and Oracle Berkeley DB Java Edition. Berkeley DB suits local development; Cassandra or HBase is expected in production.

Getting Started with ArcadeDB

If the single-engine argument is the one that lands, the fastest way to test it is to build a small knowledge graph on your own data and see whether one query really can do all three retrieval modes.

docker run --rm -p 2480:2480 -p 2424:2424 \
  -e JAVA_OPTS="-Darcadedb.server.rootPassword=playwithdata" \
  arcadedata/arcadedb:latest

Then open ArcadeDB Studio at http://localhost:2480 and start modelling. For the concepts behind entity extraction, semantic search, and temporal knowledge, see knowledge graphs on ArcadeDB. If you are building retrieval for an LLM on top of it, GraphRAG covers the hybrid retrieval patterns. If you are migrating an existing graph, the Neo4j migration guide covers Cypher and Bolt compatibility.

ArcadeDB is Apache 2.0, free in production, with no node limits and no Enterprise edition. Benchmark it against whatever you run today and keep whichever wins.

ArcadeDB 26.7.3: Three Security Advisories and a Super-Node Edge-Merge Data-Loss Fix

2026-07-17T00:00:00+00:00

ArcadeDB 26.7.3 is a hotfix on top of 26.7.2. It closes three security advisories, two in the MCP server transport and one in the JavaScript/Java trigger authorization gate, and repairs a data-loss regression in the commutative super-node edge-append merge that 26.7.2 introduced.

Upgrade if you can. It matters most if you expose the MCP server, run in server or multi-tenant mode, or store high-degree (super-node) graphs. There are no breaking changes and no schema migration in this release.

Major Highlights

Security Advisories

All three come from the same internal audit that produced the 26.7.2 fixes, covering MCP-transport and trigger surfaces that round did not reach.

MCP command transport disabled all engine permission checks (GHSA-6x73-v3rc-f57c). The MCP transport never bound the authenticated principal onto the request thread’s DatabaseContext, so the engine permission gates (which are deliberate no-ops when no user is bound) silently passed for every MCP caller. A non-root, MCP-allowed reader could perform arbitrary writes, DDL, and schema or security mutation; the query + js sub-case could execute arbitrary in-JVM JavaScript. The principal is now bound at the single DB-resolution chokepoint and cleared on the pooled worker thread, so the engine per-user gates enforce for MCP exactly as they do for the HTTP, Bolt, PostgreSQL, and gRPC transports.
MCP get_server_settings leaked the HA clusterToken in cleartext (GHSA-p9wc-4fhr-78wm). The MCP get_server_settings tool masked only settings whose key contained "password", so arcadedb.ha.clusterToken was returned raw. That token is the trust anchor for cluster-forwarded authentication, so leaking it enables full root impersonation. This is the MCP sibling of the 26.7.2 fix (GHSA-46hj-24h4-j8gf); the tool now redacts both value and default via GlobalConfiguration.isHidden(), matching GetServerHandler.
JavaScript / Java triggers could escalate a schema admin to server-wide admin (GHSA-38pf-6hp2-pxww). A JAVASCRIPT trigger binds the real database object into a GraalVM context, so its script could call database.getSecurity().createUser(...) and escalate an UPDATE_SCHEMA (schema-admin) user to a server-wide admin; a JAVA trigger runs an arbitrary loaded class. Creating either host-code trigger type now requires UPDATE_SECURITY at the LocalSchema.createTrigger chokepoint, mirroring the DEFINE FUNCTION ... LANGUAGE js gate (GHSA-vwjc-v7x7-cm6g). Declarative SQL triggers keep UPDATE_SCHEMA, and schema reload of existing triggers is unaffected.

Major Fixes

Graph Engine

Edge-append merge no longer reverts concurrent writes on multi-page edge chunks (#5302). The commutative edge-append merge (GRAPH_EDGE_APPEND_MERGE, introduced in 26.7.2) resolved a commit-time page conflict by re-deriving the conflicted page and replaying the transaction’s tracked appends. For an edge chunk stored as a multi-page record (a chunk that straddles a page boundary) this re-derivation was unsound: the rebase re-read the chunk through the transaction’s stale in-transaction page copy and committed it, silently reverting concurrently committed appends on the continuation page. The observable symptoms were zeroed chunk tails, shifted or aliased pairs, lost edges, and BufferUnderflowException on later traversals of the vertex.

Only records living entirely in place on the conflicted page are now re-derived. Multi-page and indirected (placeholder) chunk records fall back to the standard full-transaction retry. Single-page chunks, which are the vast majority and include all super-node stripe chunks, keep the merge.

MCP Server

get_schema now builds its schema through a dedicated buildSchema path.
The MCP dispatcher handles a missing resource with a proper MCPResourceNotFoundException instead of a generic failure.

Getting Started with 26.7.3

Docker

docker pull arcadedata/arcadedb:26.7.3

Visit our Docker Hub repository for more information.

Maven

    com.arcadedb
    arcadedb-engine
    26.7.3

All artifacts are available on Maven Central.

Documentation

For details on features and usage, see the documentation.

Compatibility Note

This release contains no breaking changes and no schema migration. It maintains 100% compatibility with previous database formats, meaning no export/import is required when upgrading. As always, we recommend creating a database backup before upgrading.

Download ArcadeDB 26.7.3 now: GitHub Releases

Thanks to everyone in the community who reported issues, opened PRs, and helped shape this release.

Luca Garulli ArcadeDB Founder

Certified, Not Claimed: ArcadeDB’s Bolt Compatibility With Every Official Neo4j Driver

2026-07-14T00:00:00+00:00

Plenty of databases say they “support the Bolt protocol.” Almost none of them tell you what that sentence leaves out.

ArcadeDB has spoken Bolt for a while, and the honest version of our claim was narrow: the official Neo4j drivers connect, and simple queries work. Useful, but not the same thing as compatibility. The gap between the two is where users get hurt. The driver connects on Monday; on Thursday someone reads a datetime back out of a query and gets a string.

ArcadeDB 26.7.2 closes that gap. It ships a shared conformance spec, all five official drivers under test, fixes for the protocol bugs that testing exposed, and a compatibility matrix that regenerates itself every night and is published where anyone can read it.

The audit

We started by auditing what we actually had (epic #4882). Some of it was reassuring. The full Bolt 3.0/4.0/4.4 message set was implemented, ROUTE included, along with a hand-written PackStream encoder, correct structure tags for Node, Relationship, and Path, TLS via bolt+s, and Neo4j-style structured error codes.

The rest was not:

Bolt 5.x was never advertised. The server negotiated 3.0, 4.0, and 4.4. Modern 5.x drivers worked only by silently downgrading, which nobody had documented or tested as a deliberate stance.
Temporal values went out as strings. Date, Time, LocalDateTime, DateTime and friends were serialized as ISO-8601 text instead of native Bolt structures. Duration and spatial Point had no handling anywhere. The driver dutifully handed your application a String.
Neo.TransientError.* did not exist. Only a handful of ClientError and DatabaseError codes were defined, so the drivers’ managed-transaction retry logic could never fire the way it does against Neo4j.
ROUTE was single-node only. It returned this node’s own address as writer, reader, and router, so neo4j:// routing against a real HA cluster was unproven.
Three of the five official drivers had zero Bolt coverage. The Python and C# e2e suites tested the Postgres wire protocol and HTTP. There was no Go module at all.
Java and JavaScript coverage was shallow. Connect, run a query, one regression test. No transactions, no error paths, no type round-trips.

A compatibility claim you cannot fail is not a compatibility claim. That list is why the epic existed.

The rules we set

Certify depth, not presence. “The driver connects” is not certification. Every driver runs the full feature matrix, and every unsupported cell becomes a documented limitation instead of a silent omission.

Only the real drivers. neo4j-java-driver, neo4j-driver, neo4j, Neo4j.Driver, neo4j-go-driver. No mocks, no bespoke socket clients. If the driver your application imports cannot do it, we do not get to claim it.

One spec, five idiomatic suites. We deliberately did not build a YAML-driven test runner in five languages. The conformance spec is a reference document. Each scenario is hand-written into that language’s native framework (JUnit, jest, pytest, xUnit, go test) and tagged with the scenario ID. The spec owns what gets tested; the code stays polyglot and readable.

“Not supported” is an acceptable answer, as long as it is written down. Byte-for-byte parity with Neo4j server behavior was never the goal. Knowing exactly where we differ was worth more.

The spec: 39 scenarios, 9 areas

The matrix is defined once, in bolt/conformance/spec.yaml, across nine areas taken verbatim from the epic’s feature table:

Area	What it pins down
`connection`	`bolt://`, `bolt+s://` with TLS required and optional, `neo4j://` routing discovery
`auth`	Basic auth success and failure; the `none` scheme being rejected (intentional, now certified as such)
`transactions`	Autocommit, explicit BEGIN/COMMIT/ROLLBACK, managed transaction functions, retry on transient errors
`causal-consistency`	Bookmarks enforcing read-after-write across sessions
`multi-database`	Session database selection and isolation between databases on one driver
`result-handling`	Streaming `PULL`, `PULL n` resumption, `DISCARD`, `ResultSummary` counters
`type-roundtrip`	Node, Relationship, Path, ByteArray, nested collections, nulls, all five temporal types, Duration, Point
`errors`	`Neo.ClientError.` and `Neo.TransientError.` so driver retry behavior matches Neo4j
`protocol`	3.0/4.0/4.4 negotiation, 5.x negotiation, `RESET` mid-stream

Each scenario carries a stable ID (TYPE-011, PROTO-002), a fixture, given/when/then steps, and a status. Every test in every language embeds its scenario ID in the test name, so checking coverage is a grep.

The drivers: 14 pinned versions

Testing against latest tells you about today and nothing about tomorrow. Every language is pinned to a band set, resolved to concrete versions in driver-versions.md:

Driver	Versions under test
Java (`neo4j-java-driver`)	4.4.20, 5.28.5, 6.2.0
JavaScript (`neo4j-driver`)	5.28.3, 6.2.0
Python (`neo4j`)	5.28.4, 6.1.0, 6.2.0
.NET (`Neo4j.Driver`)	5.26.2, 5.28.4, 6.2.1
Go (`neo4j-go-driver`)	5.27.0, 5.28.0, 5.28.4

The latest band deliberately tracks the newest release, so a driver-side release that breaks compatibility trips our nightly run within a day of shipping.

Two of these bands carry compromises. The C# floor is 5.26.2 instead of a 4.x line, because Neo4j.Driver made breaking API changes in 5.0 that the shared suite cannot compile against. JavaScript has no 4.x band for the same reason. The 4.4 wire protocol is still covered through the Java 4.4.20 legacy driver. Both compromises are written down in the file, with the reasoning, so nobody has to reverse-engineer why a column is missing.

What the tests broke

The first full run lit up red, which was more or less the point. Everything from the audit list is now fixed in 26.7.2:

Native temporal structures. Date, Time, LocalTime, LocalDateTime, and DateTime are emitted as native PackStream structures.
Duration and Point. Both round-trip as native Bolt structures in both directions, bound parameters included.
Neo.TransientError.*. Retryable conflicts now map to transient error codes, so the drivers’ executeWrite retry loop works out of the box instead of failing your write on the first optimistic-lock conflict.
HA-aware ROUTE. neo4j:// returns the real cluster topology instead of the local node three times.
Bolt 5.x negotiation. No more silent downgrade.
Populated write counters. ResultSummary counters reflect what actually happened.

One breaking change to plan for

The temporal fix changes behavior, and it is the one thing to read before you upgrade. Values that used to arrive as strings now arrive as typed objects:

// Before 26.7.2: this returned an ISO-8601 String
String when = record.get("created").asString();

// 26.7.2 and later: it is a real temporal value
ZonedDateTime when = record.get("created").asZonedDateTime();

This is what every Neo4j driver expects, and it is still a change, so the 26.7.2 release notes call it out explicitly.

The part that keeps it true

A test suite that runs when someone remembers to run it eventually stops being true. The pipeline matters more than any single bug on that list.

All five language suites gate every pull request. On top of that, a nightly workflow runs the full cross-product: every scenario, every driver, every pinned version. Each language’s JUnit-style report becomes a set of per-cell results, the cells merge into a single bolt-compat-matrix.json, and that JSON is cross-referenced against the expected cell set from driver-versions.md. A cell that should exist and does not is treated exactly like a cell that ran and failed. Silence does not pass.

Three things then happen without anyone touching them:

A regression issue opens itself. Any red cell auto-opens a bolt-compat-regression issue, which closes itself when the matrix goes green again.
The matrix is rendered and committed. A small renderer turns the nightly JSON plus the spec metadata into COMPATIBILITY.md. Rows are scenarios grouped by area, columns are language:version, and cells are pass, fail, known limitation, not applicable, or not reported. Non-passing cells link to their tracking issue. The page carries a “last verified” timestamp and a link to the CI run that produced it.
The badge updates. A shields.io endpoint in the README goes green only when every applicable cell passes.

The red nights get published the same way the green ones do. A matrix that could only ever come out green would not be worth publishing.

What is not green

Every applicable cell passes today. Two cells are not green, both deliberately:

ERR-003 (unauthenticated request returns Neo.ClientError.Security.Forbidden) is marked not applicable. It cannot be triggered through any official driver’s public API, because every official driver completes HELLO/LOGON internally before it hands your code a session. Reaching it would take a bespoke raw-socket client, which our own rules exclude. The row stays in the matrix with the explanation attached.
CONN-004 (neo4j:// routing reflecting a real multi-node topology) is skipped, because the nightly runs against a single node. The HA-aware ROUTE implementation shipped in 26.7.2; the multi-node nightly harness for it did not. The cell carries a footnote about the HA_SERVER_LIST configuration it depends on, and it stays visible until something actually covers it.

The Java 4.4.20 column is sparse on purpose. It runs the legacy-driver subset that proves 4.4 wire negotiation still works, not the full modern-API suite.

Try it

Point any official Neo4j driver at an ArcadeDB server. Nothing special is required. Here is the same query in all five certified drivers:

try (Driver driver = GraphDatabase.driver("bolt://localhost:7687",
         AuthTokens.basic("root", "playwithdata"));
     Session session = driver.session(SessionConfig.forDatabase("beer"))) {

  session.run("MATCH (b:Beer)-[:HasCategory]->(c:Category) "
            + "WHERE c.name = $cat RETURN b.name AS name LIMIT 5",
      Map.of("cat", "Irish Ale"))
      .forEachRemaining(r -> System.out.println(r.get("name").asString()));
}

import neo4j from 'neo4j-driver';

const driver = neo4j.driver('bolt://localhost:7687',
  neo4j.auth.basic('root', 'playwithdata'));
const session = driver.session({ database: 'beer' });

const result = await session.run(
  'MATCH (b:Beer)-[:HasCategory]->(c:Category) ' +
  'WHERE c.name = $cat RETURN b.name AS name LIMIT 5',
  { cat: 'Irish Ale' }
);
result.records.forEach(r => console.log(r.get('name')));

await session.close();
await driver.close();

from neo4j import GraphDatabase

driver = GraphDatabase.driver("bolt://localhost:7687",
                              auth=("root", "playwithdata"))

with driver.session(database="beer") as session:
    result = session.run(
        "MATCH (b:Beer)-[:HasCategory]->(c:Category) "
        "WHERE c.name = $cat RETURN b.name AS name LIMIT 5",
        cat="Irish Ale",
    )
    for record in result:
        print(record["name"])

await using var driver = GraphDatabase.Driver("bolt://localhost:7687",
    AuthTokens.Basic("root", "playwithdata"));
await using var session = driver.AsyncSession(o => o.WithDatabase("beer"));

var names = await session.ExecuteReadAsync(async tx => {
    var cursor = await tx.RunAsync(
        "MATCH (b:Beer)-[:HasCategory]->(c:Category) " +
        "WHERE c.name = $cat RETURN b.name AS name LIMIT 5",
        new { cat = "Irish Ale" });
    return await cursor.ToListAsync(r => r["name"].As<string>());
});

names.ForEach(Console.WriteLine);

ctx := context.Background()

driver, err := neo4j.NewDriverWithContext("bolt://localhost:7687",
    neo4j.BasicAuth("root", "playwithdata", ""))
if err != nil {
    log.Fatal(err)
}
defer driver.Close(ctx)

session := driver.NewSession(ctx, neo4j.SessionConfig{DatabaseName: "beer"})
defer session.Close(ctx)

result, err := session.Run(ctx,
    "MATCH (b:Beer)-[:HasCategory]->(c:Category) "+
        "WHERE c.name = $cat RETURN b.name AS name LIMIT 5",
    map[string]any{"cat": "Irish Ale"})
if err != nil {
    log.Fatal(err)
}

for result.Next(ctx) {
    fmt.Println(result.Record().Values[0])
}

Managed transactions, bookmarks, neo4j:// routing, and typed temporal and spatial values behave the way the driver documentation says they should. Where they do not, the matrix says so.

Why we bothered

A large part of the graph world evaluates a database by pointing an existing driver at it, running existing Cypher, and seeing what breaks. That deserves a real answer instead of a compatibility adjective.

It is the same instinct behind our Jepsen work: publish the harness, publish the results, publish the parts that do not pass. All of it is open:

Get ArcadeDB 26.7.2: GitHub Releases or docker pull arcadedata/arcadedb:26.7.2

Run a Bolt workload against ArcadeDB and hit a case the matrix does not cover? Open an issue. A scenario we never thought to write is the most useful bug report we can get.

Roberto Franchini Director R&D

ArcadeDB 26.7.2: Deep Engine Audit, Neo4j Bolt Certification & 5 Critical Security Fixes

2026-07-09T00:00:00+00:00

We’re pleased to announce ArcadeDB 26.7.2, a stability and correctness release that resolves over 130 issues. Where 26.7.1 focused on Raft resilience, OpenTelemetry tracing, and standards alignment, 26.7.2 turns the spotlight on the deep internals: a full engine audit across storage, WAL recovery, transactions, LSM indexes, and concurrency. On top of that it completes Neo4j Bolt driver certification, hardens High Availability further, and ships five critical security fixes.

Major Highlights

Deep Engine Audit

The heart of this release is a systematic audit of the storage and execution engine. These are the guarantees you rely on every commit, tightened across the board:

Storage & recovery: WAL append is now the transaction’s point of no return, with pre-write validation. Torn 64KB page writes are repaired during recovery, and the page cache prevents both lost updates and unbounded growth.
LSM index stability: Compaction no longer loses re-inserted keys or leaks orphaned pages, range scans are no longer truncated by tombstoned entries, and non-unique lookups no longer resurrect deleted records.
Transaction safety: Double-indexed updates within a single transaction no longer corrupt indexes, dead-thread cleanup rolls back abandoned transactions properly, and virtual-thread compatibility is restored.
Parallel query safety: A dedicated thread pool for bucket scans prevents deadlocks, each worker gets an isolated command context, and native iterators bound producer offers and close cleanly.

These are the fixes you feel under load and during the unhappy path, not in a benchmark.

Neo4j Bolt Driver Certification

ArcadeDB now achieves full compatibility certification with the Neo4j Bolt driver:

Native temporal PackStream structures: date, time, datetime, and localdatetime.
Native Path, Duration, and Point types.
Retryable conflict mapping so the driver’s auto-retry works out of the box.
HA-aware ROUTE responses and Bolt 5.x negotiation support.
Populated write-result counters.

Security Hardening: 5 Critical Fixes

This release patches five critical vulnerabilities. Upgrading is strongly recommended for internet-facing and multi-tenant deployments:

RCE via JavaScript triggers: java.lang.* classes are removed from trigger allow-lists, restricting to benign packages only.
Arbitrary JavaScript execution: defining js functions now requires UPDATE_SECURITY with restricted polyglot access.
Secret disclosure via API: server settings now redact all hidden configuration keys, including cluster tokens.
Cross-database IDOR: authorization checks added to 14 HTTP handlers to prevent unauthorized database access.
Read-only mutation of schema: missing UPDATE_SCHEMA guards added to schema and config mutators.

High Availability (Raft) Enhancements

Durable Raft storage now defaults to true, preventing follower divergence.
The leader auto-recovers wedged replication channels.
A new TransactionCommittedRemotelyException (HTTP 409) prevents duplicate retries.
Offline bootstrap leadership-transfer now commits a baseline.

Major Fixes

Query Engine: OpenCypher

Dynamic property mutations (SET n[key] = value) are now applied.
Inline relationship filters are enforced in MATCH, exists(), comprehensions, and shortest-path queries.
FOREACH updates are visible to subsequent RETURN clauses.
Pattern comprehension and correlated subquery corrections.
Temporal normalization is applied in index scans.

Query Engine: SQL

FROM is now usable as a property name in DDL and queries.
ORDER BY on non-indexed properties returns correct results.
TRAVERSE with MAXDEPTH corrected; BREADTH_FIRST now performs true breadth-first traversal.
Map indexing and array serialization fixes.

Additional Features

User-defined functions are now persisted and distributed across HA clusters.
Configurable vector quantization for sparse vector indexes.
Commutative append-merge removes retry storms on high-degree vertices.
Extended Cypher write-counter surfacing over HTTP and gRPC.
Point datatype and spatial index support.
Python binding performance improvements.

Dependencies

Notable upgrades include Netty 4.2.16, Jackson 2.22.1, PostgreSQL JDBC 42.7.13, GraalVM 25.1.3, and Neo4j Java driver 6.2.0, plus the usual round of updates.

Breaking Changes

Two behavioral changes to note when upgrading:

Raft storage durability: arcadedb.ha.raftPersistStorage now defaults to true. Ensure the storage directory resides on durable media. Test clusters can opt out explicitly.
Bolt temporal types: temporal values are now transmitted as native PackStream structures instead of ISO-8601 strings. Clients must read native types (e.g., asZonedDateTime()) rather than strings.

Getting Started with 26.7.2

Docker

docker pull arcadedata/arcadedb:26.7.2

Visit our Docker Hub repository for more information.

Maven

    com.arcadedb
    arcadedb-engine
    26.7.2

All artifacts are available on Maven Central.

Documentation

For detailed information on features and usage, refer to our comprehensive documentation.

Compatibility Note

This release maintains 100% compatibility with previous database formats, meaning no export/import is required when upgrading. As always, we recommend creating a database backup before upgrading.

Download ArcadeDB 26.7.2 now: GitHub Releases

Thanks to everyone in the community who reported issues, opened PRs, and helped shape this release.

Luca Garulli ArcadeDB Founder

ArcadeDB Cloud Observability: OpenTelemetry Tracing, Structured Logging, and Kubernetes Health Probes

2026-07-07T00:00:00+00:00

A database in Kubernetes has to answer two questions the orchestrator keeps asking: are you alive, and are you ready? It also has to answer the one an on-call engineer asks at 2am, which is why a query that normally takes 8ms just took four seconds. ArcadeDB could answer none of them well. The Cloud Observability Architecture work fixes that with four pieces that ship independently: health probes, deeper metrics with OTLP export, structured logging with correlation IDs, and distributed tracing over OpenTelemetry.

All of it is opt-in. Touch no configuration key and your server behaves exactly as it does today.

What was already there, and what wasn’t

Single-node observability was in decent shape: Micrometer metrics with an in-memory registry, an optional /prometheus endpoint, JVM and executor-pool binders, an engine Profiler that tracks cache hit ratio, pages, WAL, transaction counts, and MVCC conflicts, a manual query profiler, and an /api/v1/ready endpoint.

The gaps only show up once you put that server in a cluster. Query, command, and transaction latency lived in the manual profiler, which means you had to go looking for it; there were no always-on RED (Rate, Errors, Duration) timers and no percentile histograms. There were no spans at all, so a request that crossed the query engine, a transaction, and a Raft replication hop was three disconnected mysteries. Logs were text-only, with nothing tying a line back to the request that produced it. And there was a readiness endpoint but no liveness check, which is the one Kubernetes reaches for first.

Instrument once, get two signals

The part of this design I like most is the one that required the least new code. Micrometer’s Observation API lets you instrument a code path once and emit either a timer or a span from it, depending on what happens to be registered at runtime. ArcadeDB already shipped Micrometer, so the core server now wraps its hot paths (HTTP request handling, query and command execution, transaction commit, and Raft replication) in Observation calls and stops there.

With no tracer registered, an Observation is a metrics-only timer: what already happened, under a new name. Install the tracing plugin and those same call sites start producing spans as well. No second instrumentation pass, no parallel set of trace annotations drifting out of sync with the metric ones. If you have ever maintained a codebase where the metrics and the traces disagree about what a “query” is, you know why this matters.

Metrics: RED timers and OTLP export

Three always-on timers, with percentile histograms and SLO buckets, exposed through /prometheus and optionally through OTLP:

arcadedb.http.requests, tagged by method, path template, status, and database. The path template is deliberate: tagging by raw URI is how you blow up your cardinality budget and get a call from whoever pays the metrics bill.
arcadedb.query.duration, tagged by language, database, and query type.
arcadedb.tx.duration, tagged by database and outcome (commit or rollback).

A new EngineMetricsBinder, modeled on the existing PoolMetrics, pulls the engine Profiler’s cache hit ratio, page and WAL statistics, MVCC conflict counts, and the database-level and sparse-vector numbers into Micrometer gauges tagged by database. Set arcadedb.serverMetrics.otlp.enabled=true and the same series push to an OTLP collector. The Prometheus scrape path is not touched.

Tracing, in a module you can leave on the shelf

Tracing lives in a separate, optional tracing module, packaged the same way the metrics module already is: its own Maven module, provided scope on the server, loaded through the ServerPlugin SPI. The OpenTelemetry SDK, micrometer-tracing-bridge-otel, and the OTLP exporter stay inside it. The core server’s compile classpath never sees them, which was a hard requirement: nobody should inherit the OTel dependency tree because they wanted a graph database.

Set arcadedb.serverMetrics.tracing.enabled=true and the plugin registers a bridged tracer into Micrometer’s global ObservationRegistry. The Observations described above start producing spans, nested from HTTP down through query execution and transaction commit. Inbound requests continue an upstream trace through the W3C traceparent header, and outbound Raft RPCs propagate context, so a write is traceable from leader to follower.

Leave the jar off the classpath or the flag unset and it is a genuine no-op. No span overhead, no registry, nothing.

Structured logging and correlation IDs

A timer tells you the p99 moved. A span tells you which hop ate the time. Neither tells you about the IOException that got swallowed and retried, and that is usually the thing you actually needed. So the logs have to join the same conversation.

At the start of each request, ArcadeDB populates a diagnostic context with the active trace and span IDs, plus a generated request ID so correlation still works when tracing is off. The context is scoped per request and cleared in a finally block, because the server hands threads back to a worker pool and a leaked MDC entry means the next request inherits someone else’s trace ID. That bug is miserable to find. The finally is not decorative.

An opt-in JsonLogFormatter (arcadedb.server.logFormat=json) writes one JSON object per line: timestamp, level, logger, thread, message, trace ID, span ID, database, request ID, exception. It is built on ArcadeDB’s existing JSONObject, so it adds no JSON dependency. If you want correlation without moving to JSON, arcadedb.server.logIncludeTrace=true appends a [traceId=...] tag to the text format instead.

Health probes for Kubernetes

This is the smallest change in the whole set and probably the one most people will use: GET /api/v1/health. It is deliberately cheap. No database I/O, no auth, returns 200 as long as the process and the HTTP layer are up.

Cheapness is the entire point. Liveness must not depend on database readiness. Wire a readiness-style check into livenessProbe and Kubernetes will helpfully kill a node that is still replaying its WAL, then kill the replacement for the same reason, and you will spend an afternoon reading kubelet events before you work out that your health check is the outage.

/api/v1/ready is unchanged by default. It gains one optional behavior: with arcadedb.server.readinessRequiresHA=true and HA active, readiness reports false until the node has joined the Raft group and caught up. That gives clustered deployments a readinessProbe that means something.

livenessProbe:
  httpGet:
    path: /api/v1/health
    port: 2480
  initialDelaySeconds: 10
  periodSeconds: 10
readinessProbe:
  httpGet:
    path: /api/v1/ready
    port: 2480
  initialDelaySeconds: 5
  periodSeconds: 5

Configuration reference

Every new key defaults to off, or to what the server does today:

Key	Default	Effect
`arcadedb.serverMetrics.otlp.enabled`	`false`	Register an OTLP metrics registry alongside Prometheus
`arcadedb.serverMetrics.otlp.endpoint`	`http://localhost:4317`	OTLP metrics endpoint
`arcadedb.serverMetrics.tracing.enabled`	`false`	Activate the tracing plugin (no-op if the jar isn’t present)
`arcadedb.serverMetrics.tracing.endpoint`	`http://localhost:4317`	OTLP trace endpoint
`arcadedb.serverMetrics.tracing.samplingRate`	`0.0`	Parent-based sampling ratio
`arcadedb.server.logFormat`	`text`	`json` selects the structured `JsonLogFormatter`
`arcadedb.server.logIncludeTrace`	`false`	Append `[traceId=...]` to text-mode logs
`arcadedb.server.readinessRequiresHA`	`false`	Make `/api/v1/ready` HA-aware (Raft membership and catch-up)

On not breaking anything

The whole design was built against one constraint: with no new configuration, an upgraded deployment behaves exactly as it did before. Existing endpoints and response shapes (/prometheus, /api/v1/server, /api/v1/ready) were added to, never modified. No configuration key was renamed. No new mandatory dependency lands on the core server, which is why the OpenTelemetry SDK sits behind the SPI in its own module rather than in the server pom.xml where it would have been considerably easier to put it.

Upgrade whenever you like, then turn things on one flag at a time.

Pairing it with Grafana

If you already run the ArcadeDB Grafana plugin for BI dashboards over SQL, Cypher, and Gremlin, the OTLP metrics and traces drop into the same stack. Point an OpenTelemetry Collector at the server, fan out to Prometheus and Tempo (or whatever your vendor of the month is), and you can go from a p99 spike on a dashboard to the trace behind it to the log line that explains it, without leaving the cluster.

ArcadeDB 26.7.1: Raft Hardening, OpenTelemetry Tracing, BM25 & GQL

2026-07-01T00:00:00+00:00

We’re pleased to announce ArcadeDB 26.7.1, a large stability, resilience, and security release with 420+ commits resolving 238 issues. Where 26.6.1 added encrypted HA clusters and a durability hardening pass, 26.7.1 goes deeper into cluster resilience: the Raft engine now degrades gracefully instead of halting, diverged followers heal themselves, and the whole system is far easier to observe in production thanks to OpenTelemetry distributed tracing. On top of that come native BM25 full-text scoring, ISO GQL standards alignment, and another broad security sweep.

Major Highlights

Raft & High Availability Hardening

The headline of this release is resilience. A cluster should survive a bad entry, a network blip, or a diverged replica without operator intervention:

A failed apply on one database no longer halts the entire node. The affected database is put into per-database quarantine while every other database on the node keeps serving traffic.
Diverged followers can now self-recover back to the leader’s state instead of getting stuck.
Snapshot integrity is strengthened with fsync on write and manifest verification, so a follower never installs a truncated or corrupt snapshot.
Membership changes are applied as atomic deltas under quorum protection, closing the classic split-brain and lost-quorum windows.
Silent write loss on replication timeouts has been eliminated: a write that cannot be safely replicated is reported, not dropped.
TRUNCATE TYPE and TRUNCATE BUCKET are now HA-safe, follower index correlation is fixed by stable peer-id, and a stalled replica stuck at matchIndex=-1 now auto-recovers.

These are the fixes you feel in production during the unhappy path, not in a benchmark.

OpenTelemetry Distributed Tracing & Structured Logging

Observability is now first-class. A new OpenTelemetry module provides end-to-end distributed tracing with optional OTLP export, so you can follow a request across the wire protocols, the query engine, and the storage layer in your existing tracing backend. Alongside it:

Structured JSON logging with per-request correlation IDs, so a single request is trivial to reconstruct across logs and traces.
Enhanced metrics with RED timers (Rate, Errors, Duration) and engine gauges.

Native BM25 Full-Text Scoring

Full-text search now ships a native BM25 scoring implementation with field boosts and caret (^) boost syntax, so relevance ranking is tunable per field. EXPLAIN and PROFILE now cover search operations too, making it easy to see how a full-text query is planned and executed.

ISO GQL Standards Alignment

ArcadeDB continues moving toward the ISO GQL graph query standard:

Session management statements: SESSION SET / SESSION RESET / SESSION CLOSE.
Transaction control statements: START TRANSACTION / COMMIT / ROLLBACK.
Strict numeric types for standards-compliant arithmetic and comparisons.

Security Hardening

Another broad security sweep tightens the defaults for multi-tenant and internet-facing deployments:

Polyglot scripting now requires administrative privileges.
Centralized gRPC authorization enforcement across all endpoints.
Additional path-traversal protection hardening.
DoS bounding on gRPC result materialization.
Cluster-management endpoints are now restricted to root access.

Major Fixes

Vector Engine

New import/interop formats: MATLAB, MATLAB_COLUMN, JULIA, and NUMPY.
New helper functions: asVector(), asSparse(), and vectorDequantizeBinary.
Improved RRF (Reciprocal Rank Fusion) array input handling.

MongoDB Protocol

Added update, delete, and createIndexes commands.
SASL PLAIN authentication support.
Improved find command data handling.

Kubernetes Operations

Health probe endpoints for liveness, readiness, and startup.
Auto-acquire capability for databases the node has not seen yet.
StatefulSet scale-up support beyond a static peer list.

SQL & Queries

Map/collection key removal now persists correctly in UPDATE.
Improved index selection for composite-key queries.
Three-valued logic for NULL operands in IN / NOT IN.
Correct mixed-type numeric handling in GROUP BY and aggregation functions.

Storage & Recovery

Schema recovery from the backup file when the primary schema is corrupted.
Corrupt hash-index detection and recovery.
Time-series negative-timestamp alignment fixes.
Page I/O locking improvements.

Wire Protocols

gRPC: InsertStream now terminates correctly after a commit failure.
DATETIME precision is preserved across microsecond/nanosecond formats.
OpenCypher now tolerates dangling edges.
Improved client-side nested type hydration.

Dependencies

Notable upgrades include Netty 4.2.15, Lucene 10.5.0, OpenTelemetry BOM 1.63.0, Jackson Databind 2.22.0, and JVector 4.0.0-rc.8, plus the usual round of Studio frontend, e2e harness, and CI updates.

Getting Started with 26.7.1

Docker

docker pull arcadedata/arcadedb:26.7.1

Visit our Docker Hub repository for more information.

Maven

    com.arcadedb
    arcadedb-engine
    26.7.1

All artifacts are available on Maven Central.

Documentation

For detailed information on features and usage, refer to our comprehensive documentation.

Compatibility Note

This release maintains 100% compatibility with previous database formats, meaning no export/import is required when upgrading. As always, we recommend creating a database backup before upgrading.

Download ArcadeDB 26.7.1 now: GitHub Releases

Thanks to everyone in the community who reported issues, opened PRs, and helped shape this release.

Luca Garulli ArcadeDB Founder

ArcadeDB 26.6.1: TLS for HA Clusters, Durability Hardening & Security

2026-06-03T00:00:00+00:00

We’re pleased to announce ArcadeDB 26.6.1, a stability, durability, and security focused release with 280+ commits resolving 66 issues. Where 26.5.1 was about new retrieval features, 26.6.1 is about making the engine harder to break: encrypted HA clusters, crash-safe durability, and a broad security hardening pass, on top of a long list of OpenCypher, SQL, vector, and wire-protocol fixes.

Major Highlights

TLS/SSL Across the HA Cluster

The Raft-based High Availability cluster can now run fully encrypted. Inter-node replication traffic supports SSL/TLS, and the snapshot installer was fixed so a follower can download a leader snapshot over the HTTPS listener instead of failing with Unsupported or unrecognized SSL message. Encrypted clustering is now a first-class deployment option for regulated and zero-trust environments.

Durability & Crash-Recovery Hardening

A large batch of fixes closes data-integrity gaps across the storage, WAL, and serialization layers, so committed transactions survive crashes and power loss, and recovery never silently drops data:

The WAL is fsynced on commit by default, and data files are fsynced before WAL files are deleted on a clean close.
Crash recovery aborts on a WAL version gap and preserves the WAL files instead of silently skipping it.
MutablePage.move no longer mis-tracks the modified range on backward shifts, so defragmentation bytes are never omitted from the WAL.
Binary serialization now writes a property count that matches the bytes written, and handles partial reads via readFully.
Short-write / short-read returns are respected in the paginated component file.
LZ4 compression no longer corrupts data when the source buffer position is non-zero.
The Simple-8b codec no longer silently truncates Long.MAX_VALUE / Long.MIN_VALUE.
migratedFileIds is persisted in schema.json, so compaction no longer silently drops in-flight transactions across a restart.
A NegativeArraySizeException on transaction commit was fixed.

These are the kind of fixes you never see in a benchmark but feel in production: the database does what it promised on the unhappy path.

Security Hardening

All schema mutators now require the UPDATE_SCHEMA permission (previously only createProperty was gated).
IMPORT DATABASE now validates its source and requires admin privilege, closing SSRF and local-file-inclusion vectors.
SQL injection in RemoteVertex.newEdge was fixed by switching to parameter binding (which also fixes breakage on apostrophes).
JavaScript injection in the polyglot engine was closed by replacing a “looks-like-JSON” source-concatenation heuristic with a safe Value.execute() call.
A full CodeQL cleanup resolved open Java and JavaScript code-scanning alerts at their true sources (workflow permissions, ReDoS, path-injection).

Major Fixes

High Availability & Clustering

TimeSeries data now replicates correctly across an HA cluster, and a compaction/append deadlock that caused a WAL version gap on Raft followers was eliminated.
Concurrent single-row time-series INSERTs no longer silently lose samples.
Bolt writes to a follower no longer fail with “no authenticated user in the current security context”.
PeerAddressAllowlistFilter no longer rejects legitimate peers during a Kubernetes DNS-resolution race on startup or restart.
New configurable paths for read-only and containerized deployments: arcadedb.ha.raftStorageDirectory, a configurable server log directory, and arcadedb.ha.clusterTokenPath to read the cluster shared secret from a file.
RemoteDatabase no longer reuses a session id across servers on HA failover during an open transaction; a clear TransactionException is raised on server switch instead.
New STICKY strategy pins HTTP transactions to a concrete cluster member.
/api/v1/server?mode=cluster returns the ha section again after the Raft migration.
New “Force Resync” button in Studio to recover a diverged follower from the leader.

OpenCypher

CREATE INDEX now implicitly creates the referenced property (Neo4j-style lazy schema).
nodes(), relationships(), and length() on variable-length path patterns (e.g. [*1..3]) are now implemented.
Records written via SQL are now visible to subsequent Cypher queries (and vice versa) within the same transaction.
EXPLAIN no longer fails with an idempotency error on a multi-statement query containing CREATE.
Label disjunction (n:A|B) no longer returns zero rows.
allShortestPaths() returns all co-shortest paths instead of just one.
MERGE uses a bound anchor as the traversal start instead of a full edge-type scan, and no longer crashes on single-quote property values or rebinds variables from an OPTIONAL MATCH null endpoint.
DATETIME comparison with datetime() no longer returns zero rows, and results are now consistent between parameterized and hard-coded values.

SQL

IN :param with a collection parameter now returns rows when an index is used.
MOVE VERTEX no longer generates an internal error.
expand() projection honors its AS alias instead of always being named value.
IN (SELECT …) no longer always returns empty.
MERGE on a UNIQUE-indexed property no longer throws on a duplicate key when the same key appears twice in a batch (matching Neo4j semantics).
node.* and rel.* functions no longer silently return null from SQL.
TimeSeries timestamps are now returned in queries.
New cypherRID() SQL function and asCypherRID() method for interoperating with Cypher numeric ids.

Vector & Index

TRUNCATE TYPE no longer resets an LSM_VECTOR index dimension to 0, nor leaves UNIQUE indexes in an inconsistent state.
LSMVectorIndex now converts JVector’s EUCLIDEAN return to L2² distance in all search paths, so K-NN no longer returns the worst matches first.
REBUILD INDEX now works for BY ITEM indexes.
vector.fuse() is now recognized as a SQL function.

Wire Protocols

Bolt: parameterized Cypher MATCH queries via the JavaScript neo4j-driver now work; integer property values are no longer coerced to strings after CREATE INDEX.
PostgreSQL: scalar columns are advertised with native OIDs.
gRPC: correct exceptions (NOT_FOUND for missing records), proper LocalDateTime / LocalDate handling, and InsertStream no longer rolls back a whole stream on a commit-time duplicate with CONFLICT_IGNORE.
HTTP: DuplicatedKeyException now returns 409 Conflict instead of 503 Service Unavailable.

Studio & Operations

Optional production-mode Studio, enabled by a global setting on request.
New show/hide toggle for the Appearance section in the graph side panel.
AI assistant flow, database selection, and layout improvements; query profiler “Analyze with AI”; refreshed server and profiler metrics.
New offline build mode for the distribution builder.

Dependencies

Notable upgrades include Netty 4.2.14.Final, Undertow 2.4.1.Final, Protobuf 4.35.0, JLine 4.1.3, JUnit Jupiter 6.1.0, Jackson Databind 2.21.4, Apache Commons Configuration 2.15.1, Swagger 2.2.50, SLF4J 2.0.18, and Logback 1.5.33, plus the usual round of Studio frontend, e2e harness, and CI updates.

Getting Started with 26.6.1

Docker

docker pull arcadedata/arcadedb:26.6.1

Visit our Docker Hub repository for more information.

Maven

    com.arcadedb
    arcadedb-engine
    26.6.1

All artifacts are available on Maven Central.

Documentation

For detailed information on features and usage, refer to our comprehensive documentation.

Compatibility Note

This release maintains 100% compatibility with previous database formats, meaning no export/import is required when upgrading. As always, we recommend creating a database backup before upgrading.

Download ArcadeDB 26.6.1 now: GitHub Releases

Thanks to everyone in the community who reported issues, opened PRs, and helped shape this release.

Luca Garulli ArcadeDB Founder

Deploy an ArcadeDB Cluster on Kubernetes with the Official Helm Chart

2026-05-13T00:00:00+00:00

Running ArcadeDB as a single container is easy. Running it as a replicated service on a Kubernetes cluster used to mean writing a fair amount of YAML and reading the HA docs twice. With the official arcadedb-helm chart, it now takes one command.

In this post I walk through the chart, show how to bring up a three-node HA cluster, and point at the companion arcadedb-deployments repository if you want a runnable local example before touching your production cluster.

Why Run ArcadeDB on Kubernetes

ArcadeDB is built around an embedded engine that scales vertically very well. What you get from Kubernetes is the operational layer: rolling upgrades, persistent volumes, automatic restarts when a node dies, horizontal scale for read-heavy workloads, and replication across availability zones.

The Helm chart wraps that into a StatefulSet with stable network identities, a headless service for peer discovery, and probes wired to the /api/v1/ready endpoint. When replicaCount is greater than 1, the chart turns on Raft consensus across the pods. No extra flags, no manual peer lists.

What the Helm Chart Gives You

The chart lives under charts/arcadedb and is published on Artifact Hub. The current chart version is 26.4.2, the same as the ArcadeDB engine version it deploys.

The defaults are sensible. You get a StatefulSet with stable pod names (arcadedb-0, arcadedb-1, …) and ordered rollout, a headless service so each pod resolves its peers via DNS (arcadedb-0.arcadedb.default.svc.cluster.local), and a PersistentVolumeClaim template (8Gi ReadWriteOnce by default) mounted at /home/arcadedb/databases. Liveness and readiness probes hit /api/v1/ready.

Security is also taken care of: the pod runs as non-root UID/GID 1000, all Linux capabilities are dropped, privilege escalation is disabled, and the ServiceAccount token is unmounted because the database does not call the Kubernetes API. A NetworkPolicy can lock the Raft gRPC port down to ArcadeDB pods only, and there is HorizontalPodAutoscaler support that pre-sizes the Raft peer list to maxReplicas so scale-out joins are clean.

The whole chart is small enough to read in a single sitting, which I recommend before you push it to production.

Prerequisites

You need a Kubernetes cluster (1.27 or newer is fine), Helm 3.16 or newer, kubectl pointed at the target cluster, and a storage class that supports ReadWriteOnce. The defaults on EKS, GKE, AKS, and DigitalOcean all work. For local experimentation, kind 0.24 or newer is enough.

The 30-Second Install

helm repo add arcadedb https://helm.arcadedb.com/
helm repo update
helm install my-arcadedb arcadedb/arcadedb

That is it. You now have a single-pod ArcadeDB with a persistent volume and a ClusterIP service.

Port-forward to reach Studio:

kubectl port-forward svc/my-arcadedb 2480:2480

Open http://localhost:2480 in your browser. Done.

For a dev box, a CI fixture, or a smoke test, this is enough. Anything user-facing needs more.

Production Values: a Three-Node HA Cluster

For the multi-node setup, drop the following into a values.yaml:

replicaCount: 3

image:
  repository: arcadedata/arcadedb
  tag: "26.4.2"
  pullPolicy: IfNotPresent

arcadedb:
  rootPassword:
    secret:
      name: arcadedb-credentials
      key: rootPassword

persistence:
  enabled: true
  size: 50Gi
  storageClass: "fast-ssd"

resources:
  requests:
    cpu: "1"
    memory: "4Gi"
  limits:
    cpu: "2"
    memory: "8Gi"

service:
  type: ClusterIP

ingress:
  enabled: true
  className: "nginx"
  hosts:
    - host: arcadedb.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: arcadedb-tls
      hosts:
        - arcadedb.example.com

networkPolicy:
  enabled: true

Create the credentials secret separately, so the password never lives in your Helm values or your Git history:

kubectl create secret generic arcadedb-credentials \
  --from-literal=rootPassword='choose-something-strong'

Then install (or upgrade) the chart:

helm upgrade --install arcadedb arcadedb/arcadedb \
  --namespace arcadedb --create-namespace \
  -f values.yaml --wait --timeout 10m

With replicaCount: 3, the chart wires the StatefulSet for Raft HA. Each pod gets its own PVC, joins the cluster through the headless service, and the three-node quorum elects a leader.

Verifying the Cluster

Watch the pods come up:

kubectl -n arcadedb get pods -w

You should see arcadedb-0, arcadedb-1, and arcadedb-2 reach Running in order. Once all three are ready, ask the cluster who is in charge:

kubectl -n arcadedb port-forward svc/arcadedb 2480:2480 &
curl -u root:choose-something-strong http://localhost:2480/api/v1/server | jq .ha

The response includes the current leader, the list of replicas, and the network status of each peer. If you see three online servers and one of them flagged as leader, you have a working HA cluster.

To prove the failover works, delete the leader pod and watch the cluster re-elect:

kubectl -n arcadedb delete pod arcadedb-0
kubectl -n arcadedb get pods -w

The remaining nodes hold quorum, a new leader is elected within seconds, and Kubernetes brings the missing pod back. Its PVC is reattached, the data is intact, and it rejoins the Raft group as a follower.

Try It Locally First: the arcadedb-deployments Repo

Before opening a PR against your platform team’s repo, run the thing end-to-end on your laptop. The arcadedb-deployments repository has a ready-to-run example under kubernetes/.

The start.sh script creates a kind cluster named arcadedb, runs helm dependency update, installs the chart with --wait, applies a 3-replica values.yaml and the credentials secret, waits for /api/v1/ready to respond on every pod, and sets up a background kubectl port-forward to http://localhost:2480. test.sh then drives an end-to-end smoke test against the cluster.

Clone, run, done:

git clone https://github.com/ArcadeData/arcadedb-deployments.git
cd arcadedb-deployments/kubernetes
./start.sh
./test.sh

When you are finished:

./stop.sh

It is the fastest way to convince yourself (or your team) that the chart behaves the way you expect. Same chart, same values shape, same probes, smaller cluster.

The same repository ships an ha-cluster/ scenario built on Docker Compose if you want to compare the same topology without Kubernetes in the picture.

Operating the Cluster

A few practical notes for day-two operations.

Upgrades

Bump the chart and image tag together, then helm upgrade. The StatefulSet rolls pods one at a time, the readiness probe gates each step, and Raft tolerates the missing follower throughout. Always upgrade in a non-production environment first to validate the engine version.

Scaling

To scale out, increase replicaCount and run helm upgrade. New pods come up, join the Raft group as followers, and start serving reads.

Scale-down needs more care. Never drop below the quorum size of your current cluster, and always remove pods one at a time. Three or five nodes covers most workloads. Seven is the upper end before the Raft commit cost outweighs the redundancy you get back.

Backups

ArcadeDB has built-in automatic database backups. On Kubernetes, point the backup directory at a separate volume (or a CSI driver that snapshots to object storage) so backup data lives outside the database PVC. Take the snapshot at the leader to get a consistent view.

Observability

The chart exposes the standard ArcadeDB metrics on the HTTP port. Scrape them with your existing Prometheus stack and alert on Raft leader changes, replication lag, and PVC capacity.

Security

Change the default root password. Always. Use a Secret, never --set it on the command line. Enable the included NetworkPolicy to keep the Raft port internal to the namespace. If you expose Studio publicly, put it behind your usual ingress, OIDC proxy, or VPN.

Where to Go Next

arcadedb-helm: chart source, values reference, and CI tests
arcadedb-deployments: runnable Kubernetes and Docker Compose examples
ArcadeDB HA Cluster docs: how Raft replication works under the hood
ArcadeDB Academy: free courses, including hands-on labs

If something does not work the way this post describes, open an issue on the chart repo. PRs are welcome too. The chart is actively maintained, the CI pipeline lints every change, and the helm-unittest suite already covers most templates.

ArcadeDB 26.5.1: Sparse Vector Index, Hybrid Retrieval & INT8 End-to-End

2026-05-11T00:00:00+00:00

We’re excited to announce ArcadeDB 26.5.1, a major release with 270+ commits resolving 128 issues. The headline feature is a brand-new sparse vector index with server-side hybrid retrieval and INT8 quantization end-to-end, alongside extensive OpenCypher correctness improvements and query partitioning.

Major New Features

Sparse Vector Index & Hybrid Retrieval

The new LSM_SPARSE_VECTOR index type enables sparse-embedding retrieval (BM25/SPLADE-style) directly inside ArcadeDB.

Dense vectors capture semantic meaning across every dimension; sparse vectors capture exact keyword signals across a much larger vocabulary, with most positions empty. ArcadeDB 26.5.1 supports both, and can fuse them server-side.

Highlights:

vector.fuse(...) performs server-side result fusion using RRF, DBSF, and LINEAR strategies, so dense + sparse + lexical scores can be combined without round-trips to the client.
vector.neighbors(...) supports groupBy / groupSize options for diversified retrieval with nested-field grouping.
WAND / BlockMax-WAND dynamic pruning scales sparse retrieval to 100M+ documents.
Sparse-vector partitioning allows sharding by tenant or domain.
New reranker SQL functions enable two-stage retrieval pipelines.

INT8 Quantization for Dense Vectors

End-to-end INT8 support throughout the dense vector pipeline, dramatically reducing disk and RSS by avoiding the FP32 path entirely. A shared 8-bit representation now flows across ingest, storage, and query.

EXTERNAL Property Storage

A new paired-bucket layout isolates heavy property values (vectors, large strings, JSON) to separate external buckets while keeping the hot row data compact. The result: significantly cheaper scans on wide records.

EXTERNAL Property Storage moves heavy values (vectors, large strings, JSON) to a paired external bucket. The main bucket stays compact, scans stay hot, and large payloads are loaded lazily only when the row is actually read.

Query Partitioning

A partition-aware planner now prunes unnecessary partitions from SQL and Cypher execution plans, with integrity safeguards for partitioned types.

High Availability: Offline Cluster Bootstrap

Fresh HA clusters can now initialize from pre-seeded databases via snapshot-and-restore, eliminating the need for full dataset re-replication when expanding or rebuilding a cluster.

Production-Ready Helm Chart

The Helm chart has been reworked to align with the Raft-based HA subsystem introduced in 26.4.2, and is now suitable for production deployments.

Cypher Administrative Commands

Standard administrative commands SHOW INDEXES and SHOW CONSTRAINTS are now supported in OpenCypher.

SQL: FIND REFERENCES

The OrientDB-compatible FIND REFERENCES command is back, making it easy to locate all records pointing to a given RID — particularly useful for migrations from OrientDB.

C# End-to-End Testing

A new C# test suite validates ArcadeDB over the PostgreSQL wire protocol via Npgsql and Testcontainers on every build.

Studio Enhancements

Full-screen graph view mode
Clear query button / textbox
Session reset on token expiration
Persistent error message display
Query history no longer auto-submits
Inherited indexes now visible
HA cluster peer add / remove controls
Human-readable peer names in HA_SERVER_LIST

Major Fixes

OpenCypher Correctness

This release lands an extensive batch of OpenCypher fixes across pattern matching, write clauses, subqueries, and temporal expressions. Among the highlights:

valueType(...) now reports the NOT NULL suffix for non-null values.
point(...) WGS-84-3D exposes .height as a .z alias.
CALL ... YIELD preserves carried WITH variables.
Variable-length patterns no longer re-traverse previously bound relationships.
MERGE with an unbound label-only endpoint creates fresh nodes appropriately.
SET correctly propagates across all aliases for the same node.
Self-referential property updates remain idempotent across row fanout.
Temporal component access on date/datetime values now works correctly.
EXISTS { ... } subqueries correctly evaluate outer-variable expressions.
MATCH immediately after CREATE now sees newly created labeled nodes.
MERGE ... ON MATCH SET returns post-update property values.
MATCH on parent edge types matches sub-typed edges (polymorphic traversal).
shortestPath / allShortestPaths with variable-length alternation match correctly.
WHERE false literal predicates are no longer ignored.

…plus dozens more. See the full release notes for the complete list.

SQL

CONTAINSALL compares lists of Identifiables against RID strings correctly.
Correlated COLLECT { ... } / COUNT { ... } subqueries evaluate with outer-variable access.
SEARCH_INDEX and SEARCH_FIELDS propagate return values in filters and handle wildcards properly.
SELECT with a non-unique LSM index returns rows after partial deletes.
Edge creation with CONTENT no longer ignores properties.
algo.dijkstra yields correct weight calculations.
UPDATE EDGE SET @in / @out correctly rewires vertex edge lists.
point.withinBBox(...) supports cross-meridian bounding boxes.

Storage, Indexing & Schema

HASH index lookups return rows with data encryption enabled.
Orphan TypeIndex wrappers are dropped when the last bucket child is removed.
Subclass indexes are no longer incorrectly related to superclass indexes.
Manual index names are respected on creation.
Inherited indexes are now visible in Studio.

High Availability

Schema changes replicate to followers, closing WAL gaps.
Cluster inconsistency reports after node shutdowns resolved.
Massive inserts via gRPC replicate correctly.
/api/v1/batch no longer fails on followers with “Error on updating dictionary”.
/batch endpoint eliminates HTTP 500 NPE after successful commits.
e2e-ha integration tests stabilized with on-demand Toxiproxy support.

Wire Protocols

PostgreSQL

Empty SELECT results include RowDescription schema.
SHOW server_version returns a proper value for SQLAlchemy.
Cypher WHERE id(n) IN $array round-trips correctly.
Binary array deserialization implemented for JDBC setArray.
Named and positional parameters now work via Npgsql (C#).

Bolt

EXPLAIN / PROFILE plans are included in PULL SUCCESS metadata.
Executor recognizes the new sparse vector type.

gRPC

InsertStream throughput stays consistent after extended executeQuery calls.
Commit-time constraint violations surface as stream-level errors.
DATE columns no longer corrupted via parameter binding.
ARRAY_OF_LONGS and DATETIME preserve precision in parameter binding.

HTTP

INT8 query vectors routed via $bytes / $int8 markers.
RemoteGraphBatch honors unique edge constraints.
Edge DATETIME parser accepts ISO suffixes.

Dependencies

Notable upgrades include Netty 4.2.13.Final, Undertow 2.4.0.Final, PostgreSQL JDBC 42.7.11, Neo4j Java Driver 6.1.0, Jackson Databind 2.21.3, GraalVM 25.0.3, Testcontainers 2.0.5, plus Studio frontend improvements and security updates across the dependency stack.

Getting Started with 26.5.1

Docker

docker pull arcadedata/arcadedb:26.5.1

Visit our Docker Hub repository for more information.

Maven

    com.arcadedb
    arcadedb-engine
    26.5.1

All artifacts are available on Maven Central.

Documentation

For detailed information on features and usage, refer to our comprehensive documentation.

Compatibility Note

This release maintains 100% compatibility with previous database formats, meaning no export/import is required when upgrading. As always, we recommend creating a database backup before upgrading.

Download ArcadeDB 26.5.1 now: GitHub Releases

Thanks to everyone in the community who reported issues, opened PRs, and helped shape this release.

Luca Garulli ArcadeDB Founder