What is GraphBatch in ArcadeDB?

GraphBatch is a new engine-level bulk import API introduced in ArcadeDB v26.3.2 that dramatically accelerates graph data ingestion by bypassing the overhead of the standard transactional API.

How much faster is GraphBatch compared to the standard API?

In benchmarks with 1 million vertices and 10 million edges, GraphBatch is 8.39x faster than the standard API for light edges and up to 11.77x faster at the 100K vertices / 1M edges scale.

Does GraphBatch support edge properties?

Yes, GraphBatch fully supports edge properties. With integer and long properties on edges, it still achieves a 4.97x speedup over the standard API.

Can I use GraphBatch at runtime on an existing database?

Yes. GraphBatch works on live databases with existing data — it doesn't require an empty database. As long as vertex and edge types exist in the schema and vertices have valid RIDs, you can bulk-create edges at any time. Common runtime scenarios include social contact imports, IoT sensor linkage, knowledge graph entity resolution, and nightly recommendation rebuilds.

Can I use GraphBatch from any programming language, not just Java?

Yes. ArcadeDB v26.3.2 includes a new HTTP batch endpoint (POST /api/v1/batch/{database}) that uses GraphBatch under the hood. You can send vertices and edges in JSONL or CSV format from any language — Python, JavaScript, Go, curl — and get the same engine-level performance without writing a single line of Java.

GraphBatch: Up to 8x Faster Graph Ingestion in ArcadeDB

Q: Which ArcadeDB version includes GraphBatch?

GraphBatch is available starting from ArcadeDB v26.3.2.

If you’ve ever loaded millions of edges into a graph database, you know the pain: what should be a straightforward bulk import can take minutes — or even hours — as the transactional overhead stacks up. Today we’re introducing GraphBatch, a new engine-level API in ArcadeDB v26.3.2 that makes large-scale graph ingestion dramatically faster.

Why a New Importer?

ArcadeDB has always offered two ways to load graph data: the standard transactional API (batching operations in explicit transactions) and the GraphImporter (an integration-level helper that manages batching for you). Both work well for moderate workloads, but at scale the transactional overhead becomes a bottleneck.

GraphBatch takes a fundamentally different approach. Instead of wrapping the standard API, it operates directly at the storage engine level, bypassing the transactional layer entirely during bulk import. The result: throughput that scales with your hardware, not your transaction size.

The Benchmark

We ran a series of benchmarks loading graphs of increasing size on the same hardware, measuring edges ingested per second. Here are the results.

1M Vertices, 10M Edges — Light Edges (No Properties)

Method	Time (ms)	Edges/sec	Speedup
Standard API (tx/1000)	267,140	37,434	1.00x
Old GraphImporter (integration)	97,160	102,923	2.75x
New GraphBatch (engine)	31,842	314,047	8.39x

The new importer is 8.39x faster than the standard API and 3.05x faster than the previous GraphImporter. What previously took nearly 4.5 minutes now completes in about 32 seconds.

1M Vertices, 10M Edges — Edges with Properties (int + long)

Method	Time (ms)	Edges/sec	Speedup
Standard API + props (tx/1000)	267,773	37,345	1.00x
New GraphBatch + props	53,893	185,554	4.97x

Even with properties on every edge, GraphBatch delivers a 4.97x speedup. The additional serialization cost is manageable because the engine-level approach avoids the per-transaction overhead that dominates at scale.

Scaling Behavior

This is where things get really interesting. We compared how each method behaves as the graph size increases:

Scale	Std API (edges/sec)	GraphBatch (edges/sec)	Speedup
10K vertices / 100K edges	241,644	1,025,019	4.24x
100K vertices / 1M edges	103,027	1,212,756	11.77x
1M vertices / 10M edges	37,434	314,047	8.39x

Two things stand out:

The standard API degrades significantly at scale — from 241K edges/sec at 100K edges down to just 37K edges/sec at 10M edges. This is expected: as the graph grows, transaction management, index maintenance, and page cache pressure all increase.
GraphBatch holds up far better — peaking at over 1.2 million edges per second at the 1M-edge scale. At the largest scale (10M edges), memory pressure naturally reduces throughput, but it still maintains 314K edges/sec — a strong result for a single machine.

The sweet spot appears to be around the 100K–1M vertex range, where GraphBatch reaches 11.77x the throughput of the standard API.

When to Use GraphBatch

GraphBatch is designed for bulk edge creation — whether that’s during initial data loading or at runtime on an existing database. It doesn’t require an empty database: as long as vertex and edge types exist in the schema and the source/destination vertices have valid RIDs, you’re good to go.

Initial Import Scenarios

Data migration — moving graph data from another database into ArcadeDB
ETL pipelines — loading large datasets from data warehouses or data lakes
Testing and benchmarking — quickly setting up large test graphs

Runtime Scenarios

GraphBatch works on live databases with existing data, making it the right tool whenever you need to create edges in bulk at runtime:

Social networks — a user imports their contact list and you need to create thousands of KNOWS edges between existing Person vertices
IoT / time series — a periodic job links new sensor readings to their device vertices and chains them in a time series
Knowledge graphs — after an NLP pipeline extracts relationships from documents, you materialize thousands of typed edges between existing entity vertices
Recommendation engines — nightly rebuild of ALSO_BOUGHT / SIMILAR_TO edges based on updated purchase data
Incremental ETL — periodically sync new relationships from an external system into an existing graph

When NOT to Use It

Small writes — for fewer than ~100 edges, the standard API is simpler and the importer overhead isn’t worth it
Concurrent reads on the same vertices — the importer disables read-your-writes and manages its own transactions, so concurrent readers may see inconsistent state until close()
Immediate edge visibility required — in parallel mode, incoming edges aren’t fully connected until close()

For ongoing OLTP workloads with small, frequent writes, the standard transactional API remains the right choice — it provides full ACID guarantees with immediate visibility.

Runtime Usage Examples

Bulk Friend Import (Light Edges)

// Vertices already exist in the database
RID[] personRIDs = lookupExistingPersons(contactIds);

try (GraphBatch batch = database.batch()
    .withBatchSize(50_000)
    .withLightEdges(true)
    .build()) {
  for (int[] pair : contactPairs)
    batch.newEdge(personRIDs[pair[0]], "KNOWS", personRIDs[pair[1]]);
}

IoT Sensor Linkage (with WAL for Crash Safety)

try (GraphBatch batch = database.batch()
    .withBatchSize(100_000)
    .withWAL(true)
    .withCommitEvery(10_000)
    .build()) {
  for (SensorReading r : newReadings) {
    batch.newEdge(r.deviceRID, "HAS_READING", r.rid, "timestamp", r.ts);
    if (r.previousRID != null)
      batch.newEdge(r.rid, "NEXT", r.previousRID);
  }
}

Knowledge Graph Entity Resolution (with Edge Properties)

try (GraphBatch batch = database.batch()
    .withBatchSize(200_000)
    .withParallelFlush(true)
    .build()) {
  for (ExtractedRelation rel : relations)
    batch.newEdge(rel.subjectRID, rel.edgeType, rel.objectRID,
        "confidence", rel.score, "source", rel.docId);
}

Nightly Recommendation Rebuild

// Remove stale edges
database.command("sql", "DELETE EDGE ALSO_BOUGHT");

// Rebuild from recommendation engine output
try (GraphBatch batch = database.batch()
    .withBatchSize(500_000)
    .withLightEdges(true)
    .build()) {
  for (Recommendation rec : recommendations)
    batch.newEdge(rec.productRID, "ALSO_BOUGHT", rec.relatedRID);
}

Incremental Sync from External Database

try (GraphBatch batch = database.batch()
    .withBatchSize(100_000)
    .withWAL(true)
    .build()) {
  try (ResultSet rs = externalDB.executeQuery(deltaQuery)) {
    while (rs.next())
      batch.newEdge(
          lookupRID(rs.getString("from_id")),
          "REPORTS_TO",
          lookupRID(rs.getString("to_id")),
          "since", rs.getDate("start_date"));
  }
}

Tip: For runtime usage on production databases, enable WAL with withWAL(true) for crash safety. For initial imports where you can re-run on failure, leaving WAL off maximizes throughput.

HTTP Batch Endpoint — GraphBatch for Every Language

GraphBatch is a Java API, but not everyone embeds ArcadeDB in a JVM application. That’s why v26.3.2 also ships a new HTTP batch endpoint that exposes the full power of GraphBatch over REST — no Java required.

POST /api/v1/batch/{database}

It supports two input formats: JSONL (newline-delimited JSON) and CSV. Both are streamed — the server never loads the entire payload into memory, so you can push millions of records in a single request.

JSONL Format

{"@type":"vertex","@class":"Person","@id":"t1","name":"Alice","age":30}
{"@type":"vertex","@class":"Person","@id":"t2","name":"Bob","age":25}
{"@type":"edge","@class":"KNOWS","@from":"t1","@to":"t2","since":2020}

CSV Format

@type,@class,@id,name,age
vertex,Person,t1,Alice,30
vertex,Person,t2,Bob,25
---
@type,@class,@from,@to,since
edge,KNOWS,t1,t2,2020

In both formats, vertices come first, then edges. Vertices can have temporary IDs (@id) that edges reference via @from/@to. Edges can also reference existing database RIDs directly (e.g., #12:0).

Temporary ID Mapping

The response includes an idMapping object so you know what RIDs were assigned:

{
  "verticesCreated": 2,
  "edgesCreated": 1,
  "elapsedMs": 42,
  "idMapping": {"t1": "#9:0", "t2": "#9:1"}
}

Tuning via Query Parameters

All GraphBatch configuration options are exposed as query parameters:

Parameter	Default	Description
`batchSize`	100000	Max edges buffered before auto-flush
`lightEdges`	false	Property-less edges stored as connectivity only (saves ~33% I/O)
`wal`	false	Enable Write-Ahead Logging for crash safety
`parallelFlush`	true	Parallelize edge connection across async threads
`preAllocateEdgeChunks`	true	Pre-allocate edge segments on vertex creation
`edgeListInitialSize`	2048	Initial segment size in bytes (64–8192)
`bidirectional`	true	Connect both outgoing and incoming edges
`commitEvery`	50000	Edges per sub-transaction within a flush
`expectedEdgeCount`	0	Hint for auto-tuning batch size

Examples

curl (JSONL):

curl -X POST "http://localhost:2480/api/v1/batch/mydb?lightEdges=true" \
  -u root:password \
  -H "Content-Type: application/x-ndjson" \
  --data-binary @graph-data.jsonl

curl (CSV):

curl -X POST "http://localhost:2480/api/v1/batch/mydb" \
  -u root:password \
  -H "Content-Type: text/csv" \
  --data-binary @graph-data.csv

Python:

import requests

data = (
    '{"@type":"vertex","@class":"Person","@id":"p1","name":"Alice"}\n'
    '{"@type":"vertex","@class":"Person","@id":"p2","name":"Bob"}\n'
    '{"@type":"edge","@class":"KNOWS","@from":"p1","@to":"p2"}\n'
)

resp = requests.post(
    "http://localhost:2480/api/v1/batch/mydb?lightEdges=true",
    auth=("root", "password"),
    headers={"Content-Type": "application/x-ndjson"},
    data=data,
)
print(resp.json())
# {'verticesCreated': 2, 'edgesCreated': 1, 'elapsedMs': 15, 'idMapping': {'p1': '#9:0', 'p2': '#9:1'}}

JavaScript (Node.js):

const resp = await fetch("http://localhost:2480/api/v1/batch/mydb", {
  method: "POST",
  headers: {
    "Content-Type": "application/x-ndjson",
    Authorization: "Basic " + btoa("root:password"),
  },
  body: [
    '{"@type":"vertex","@class":"Person","@id":"p1","name":"Alice"}',
    '{"@type":"vertex","@class":"Person","@id":"p2","name":"Bob"}',
    '{"@type":"edge","@class":"KNOWS","@from":"p1","@to":"p2"}',
  ].join("\n"),
});
console.log(await resp.json());

Tip: For maximum throughput, group vertices by type in the input. The endpoint batches consecutive same-type vertices into a single createVertices() call. Interleaving types forces smaller batches.

Tip: The endpoint is NOT atomic by design — GraphBatch commits internally in chunks for maximum throughput. Treat it as a bulk-loading operation, not a transactional one. The response tells you exactly how many records were committed.

Get Started

GraphBatch is available starting from ArcadeDB v26.3.2. Check out the documentation for API details and usage examples.

Download ArcadeDB v26.3.2: GitHub Releases

If you have questions or feedback, join us on Discord or open an issue on GitHub.