Back to Blog

GraphBatch: Up to 8x Faster Graph Ingestion in ArcadeDB

If you’ve ever loaded millions of edges into a graph database, you know the pain: what should be a straightforward bulk import can take minutes — or even hours — as the transactional overhead stacks up. Today we’re introducing GraphBatch, a new engine-level API in ArcadeDB v26.3.2 that makes large-scale graph ingestion dramatically faster.

Why a New Importer?

ArcadeDB has always offered two ways to load graph data: the standard transactional API (batching operations in explicit transactions) and the GraphImporter (an integration-level helper that manages batching for you). Both work well for moderate workloads, but at scale the transactional overhead becomes a bottleneck.

GraphBatch takes a fundamentally different approach. Instead of wrapping the standard API, it operates directly at the storage engine level, bypassing the transactional layer entirely during bulk import. The result: throughput that scales with your hardware, not your transaction size.

The Benchmark

We ran a series of benchmarks loading graphs of increasing size on the same hardware, measuring edges ingested per second. Here are the results.

1M Vertices, 10M Edges — Light Edges (No Properties)

Method Time (ms) Edges/sec Speedup
Standard API (tx/1000) 267,140 37,434 1.00x
Old GraphImporter (integration) 97,160 102,923 2.75x
New GraphBatch (engine) 31,842 314,047 8.39x

The new importer is 8.39x faster than the standard API and 3.05x faster than the previous GraphImporter. What previously took nearly 4.5 minutes now completes in about 32 seconds.

1M Vertices, 10M Edges — Edges with Properties (int + long)

Method Time (ms) Edges/sec Speedup
Standard API + props (tx/1000) 267,773 37,345 1.00x
New GraphBatch + props 53,893 185,554 4.97x

Even with properties on every edge, GraphBatch delivers a 4.97x speedup. The additional serialization cost is manageable because the engine-level approach avoids the per-transaction overhead that dominates at scale.

Scaling Behavior

This is where things get really interesting. We compared how each method behaves as the graph size increases:

Scale Std API (edges/sec) GraphBatch (edges/sec) Speedup
10K vertices / 100K edges 241,644 1,025,019 4.24x
100K vertices / 1M edges 103,027 1,212,756 11.77x
1M vertices / 10M edges 37,434 314,047 8.39x

Two things stand out:

  1. The standard API degrades significantly at scale — from 241K edges/sec at 100K edges down to just 37K edges/sec at 10M edges. This is expected: as the graph grows, transaction management, index maintenance, and page cache pressure all increase.

  2. GraphBatch holds up far better — peaking at over 1.2 million edges per second at the 1M-edge scale. At the largest scale (10M edges), memory pressure naturally reduces throughput, but it still maintains 314K edges/sec — a strong result for a single machine.

The sweet spot appears to be around the 100K–1M vertex range, where GraphBatch reaches 11.77x the throughput of the standard API.

When to Use GraphBatch

GraphBatch is designed for bulk edge creation — whether that’s during initial data loading or at runtime on an existing database. It doesn’t require an empty database: as long as vertex and edge types exist in the schema and the source/destination vertices have valid RIDs, you’re good to go.

Initial Import Scenarios

  • Data migration — moving graph data from another database into ArcadeDB
  • ETL pipelines — loading large datasets from data warehouses or data lakes
  • Testing and benchmarking — quickly setting up large test graphs

Runtime Scenarios

GraphBatch works on live databases with existing data, making it the right tool whenever you need to create edges in bulk at runtime:

  • Social networks — a user imports their contact list and you need to create thousands of KNOWS edges between existing Person vertices
  • IoT / time series — a periodic job links new sensor readings to their device vertices and chains them in a time series
  • Knowledge graphs — after an NLP pipeline extracts relationships from documents, you materialize thousands of typed edges between existing entity vertices
  • Recommendation engines — nightly rebuild of ALSO_BOUGHT / SIMILAR_TO edges based on updated purchase data
  • Incremental ETL — periodically sync new relationships from an external system into an existing graph

When NOT to Use It

  • Small writes — for fewer than ~100 edges, the standard API is simpler and the importer overhead isn’t worth it
  • Concurrent reads on the same vertices — the importer disables read-your-writes and manages its own transactions, so concurrent readers may see inconsistent state until close()
  • Immediate edge visibility required — in parallel mode, incoming edges aren’t fully connected until close()

For ongoing OLTP workloads with small, frequent writes, the standard transactional API remains the right choice — it provides full ACID guarantees with immediate visibility.

Runtime Usage Examples

Bulk Friend Import (Light Edges)

// Vertices already exist in the database
RID[] personRIDs = lookupExistingPersons(contactIds);

try (GraphBatch batch = database.batch()
    .withBatchSize(50_000)
    .withLightEdges(true)
    .build()) {
  for (int[] pair : contactPairs)
    batch.newEdge(personRIDs[pair[0]], "KNOWS", personRIDs[pair[1]]);
}

IoT Sensor Linkage (with WAL for Crash Safety)

try (GraphBatch batch = database.batch()
    .withBatchSize(100_000)
    .withWAL(true)
    .withCommitEvery(10_000)
    .build()) {
  for (SensorReading r : newReadings) {
    batch.newEdge(r.deviceRID, "HAS_READING", r.rid, "timestamp", r.ts);
    if (r.previousRID != null)
      batch.newEdge(r.rid, "NEXT", r.previousRID);
  }
}

Knowledge Graph Entity Resolution (with Edge Properties)

try (GraphBatch batch = database.batch()
    .withBatchSize(200_000)
    .withParallelFlush(true)
    .build()) {
  for (ExtractedRelation rel : relations)
    batch.newEdge(rel.subjectRID, rel.edgeType, rel.objectRID,
        "confidence", rel.score, "source", rel.docId);
}

Nightly Recommendation Rebuild

// Remove stale edges
database.command("sql", "DELETE EDGE ALSO_BOUGHT");

// Rebuild from recommendation engine output
try (GraphBatch batch = database.batch()
    .withBatchSize(500_000)
    .withLightEdges(true)
    .build()) {
  for (Recommendation rec : recommendations)
    batch.newEdge(rec.productRID, "ALSO_BOUGHT", rec.relatedRID);
}

Incremental Sync from External Database

try (GraphBatch batch = database.batch()
    .withBatchSize(100_000)
    .withWAL(true)
    .build()) {
  try (ResultSet rs = externalDB.executeQuery(deltaQuery)) {
    while (rs.next())
      batch.newEdge(
          lookupRID(rs.getString("from_id")),
          "REPORTS_TO",
          lookupRID(rs.getString("to_id")),
          "since", rs.getDate("start_date"));
  }
}

Tip: For runtime usage on production databases, enable WAL with withWAL(true) for crash safety. For initial imports where you can re-run on failure, leaving WAL off maximizes throughput.

HTTP Batch Endpoint — GraphBatch for Every Language

GraphBatch is a Java API, but not everyone embeds ArcadeDB in a JVM application. That’s why v26.3.2 also ships a new HTTP batch endpoint that exposes the full power of GraphBatch over REST — no Java required.

POST /api/v1/batch/{database}

It supports two input formats: JSONL (newline-delimited JSON) and CSV. Both are streamed — the server never loads the entire payload into memory, so you can push millions of records in a single request.

JSONL Format

{"@type":"vertex","@class":"Person","@id":"t1","name":"Alice","age":30}
{"@type":"vertex","@class":"Person","@id":"t2","name":"Bob","age":25}
{"@type":"edge","@class":"KNOWS","@from":"t1","@to":"t2","since":2020}

CSV Format

@type,@class,@id,name,age
vertex,Person,t1,Alice,30
vertex,Person,t2,Bob,25
---
@type,@class,@from,@to,since
edge,KNOWS,t1,t2,2020

In both formats, vertices come first, then edges. Vertices can have temporary IDs (@id) that edges reference via @from/@to. Edges can also reference existing database RIDs directly (e.g., #12:0).

Temporary ID Mapping

The response includes an idMapping object so you know what RIDs were assigned:

{
  "verticesCreated": 2,
  "edgesCreated": 1,
  "elapsedMs": 42,
  "idMapping": {"t1": "#9:0", "t2": "#9:1"}
}

Tuning via Query Parameters

All GraphBatch configuration options are exposed as query parameters:

Parameter Default Description
batchSize 100000 Max edges buffered before auto-flush
lightEdges false Property-less edges stored as connectivity only (saves ~33% I/O)
wal false Enable Write-Ahead Logging for crash safety
parallelFlush true Parallelize edge connection across async threads
preAllocateEdgeChunks true Pre-allocate edge segments on vertex creation
edgeListInitialSize 2048 Initial segment size in bytes (64–8192)
bidirectional true Connect both outgoing and incoming edges
commitEvery 50000 Edges per sub-transaction within a flush
expectedEdgeCount 0 Hint for auto-tuning batch size

Examples

curl (JSONL):

curl -X POST "http://localhost:2480/api/v1/batch/mydb?lightEdges=true" \
  -u root:password \
  -H "Content-Type: application/x-ndjson" \
  --data-binary @graph-data.jsonl

curl (CSV):

curl -X POST "http://localhost:2480/api/v1/batch/mydb" \
  -u root:password \
  -H "Content-Type: text/csv" \
  --data-binary @graph-data.csv

Python:

import requests

data = (
    '{"@type":"vertex","@class":"Person","@id":"p1","name":"Alice"}\n'
    '{"@type":"vertex","@class":"Person","@id":"p2","name":"Bob"}\n'
    '{"@type":"edge","@class":"KNOWS","@from":"p1","@to":"p2"}\n'
)

resp = requests.post(
    "http://localhost:2480/api/v1/batch/mydb?lightEdges=true",
    auth=("root", "password"),
    headers={"Content-Type": "application/x-ndjson"},
    data=data,
)
print(resp.json())
# {'verticesCreated': 2, 'edgesCreated': 1, 'elapsedMs': 15, 'idMapping': {'p1': '#9:0', 'p2': '#9:1'}}

JavaScript (Node.js):

const resp = await fetch("http://localhost:2480/api/v1/batch/mydb", {
  method: "POST",
  headers: {
    "Content-Type": "application/x-ndjson",
    Authorization: "Basic " + btoa("root:password"),
  },
  body: [
    '{"@type":"vertex","@class":"Person","@id":"p1","name":"Alice"}',
    '{"@type":"vertex","@class":"Person","@id":"p2","name":"Bob"}',
    '{"@type":"edge","@class":"KNOWS","@from":"p1","@to":"p2"}',
  ].join("\n"),
});
console.log(await resp.json());

Tip: For maximum throughput, group vertices by type in the input. The endpoint batches consecutive same-type vertices into a single createVertices() call. Interleaving types forces smaller batches.

Tip: The endpoint is NOT atomic by design — GraphBatch commits internally in chunks for maximum throughput. Treat it as a bulk-loading operation, not a transactional one. The response tells you exactly how many records were committed.

Get Started

GraphBatch is available starting from ArcadeDB v26.3.2. Check out the documentation for API details and usage examples.

Download ArcadeDB v26.3.2: GitHub Releases

If you have questions or feedback, join us on Discord or open an issue on GitHub.