If you’ve ever loaded millions of edges into a graph database, you know the pain: what should be a straightforward bulk import can take minutes — or even hours — as the transactional overhead stacks up. Today we’re introducing GraphBatch, a new engine-level API in ArcadeDB v26.3.2 that makes large-scale graph ingestion dramatically faster.
Why a New Importer?
ArcadeDB has always offered two ways to load graph data: the standard transactional API (batching operations in explicit transactions) and the GraphImporter (an integration-level helper that manages batching for you). Both work well for moderate workloads, but at scale the transactional overhead becomes a bottleneck.
GraphBatch takes a fundamentally different approach. Instead of wrapping the standard API, it operates directly at the storage engine level, bypassing the transactional layer entirely during bulk import. The result: throughput that scales with your hardware, not your transaction size.
The Benchmark
We ran a series of benchmarks loading graphs of increasing size on the same hardware, measuring edges ingested per second. Here are the results.
1M Vertices, 10M Edges — Light Edges (No Properties)
| Method | Time (ms) | Edges/sec | Speedup |
|---|---|---|---|
| Standard API (tx/1000) | 267,140 | 37,434 | 1.00x |
| Old GraphImporter (integration) | 97,160 | 102,923 | 2.75x |
| New GraphBatch (engine) | 31,842 | 314,047 | 8.39x |
The new importer is 8.39x faster than the standard API and 3.05x faster than the previous GraphImporter. What previously took nearly 4.5 minutes now completes in about 32 seconds.
1M Vertices, 10M Edges — Edges with Properties (int + long)
| Method | Time (ms) | Edges/sec | Speedup |
|---|---|---|---|
| Standard API + props (tx/1000) | 267,773 | 37,345 | 1.00x |
| New GraphBatch + props | 53,893 | 185,554 | 4.97x |
Even with properties on every edge, GraphBatch delivers a 4.97x speedup. The additional serialization cost is manageable because the engine-level approach avoids the per-transaction overhead that dominates at scale.
Scaling Behavior
This is where things get really interesting. We compared how each method behaves as the graph size increases:
| Scale | Std API (edges/sec) | GraphBatch (edges/sec) | Speedup |
|---|---|---|---|
| 10K vertices / 100K edges | 241,644 | 1,025,019 | 4.24x |
| 100K vertices / 1M edges | 103,027 | 1,212,756 | 11.77x |
| 1M vertices / 10M edges | 37,434 | 314,047 | 8.39x |
Two things stand out:
-
The standard API degrades significantly at scale — from 241K edges/sec at 100K edges down to just 37K edges/sec at 10M edges. This is expected: as the graph grows, transaction management, index maintenance, and page cache pressure all increase.
-
GraphBatch holds up far better — peaking at over 1.2 million edges per second at the 1M-edge scale. At the largest scale (10M edges), memory pressure naturally reduces throughput, but it still maintains 314K edges/sec — a strong result for a single machine.
The sweet spot appears to be around the 100K–1M vertex range, where GraphBatch reaches 11.77x the throughput of the standard API.
When to Use GraphBatch
GraphBatch is designed for bulk edge creation — whether that’s during initial data loading or at runtime on an existing database. It doesn’t require an empty database: as long as vertex and edge types exist in the schema and the source/destination vertices have valid RIDs, you’re good to go.
Initial Import Scenarios
- Data migration — moving graph data from another database into ArcadeDB
- ETL pipelines — loading large datasets from data warehouses or data lakes
- Testing and benchmarking — quickly setting up large test graphs
Runtime Scenarios
GraphBatch works on live databases with existing data, making it the right tool whenever you need to create edges in bulk at runtime:
- Social networks — a user imports their contact list and you need to create thousands of KNOWS edges between existing Person vertices
- IoT / time series — a periodic job links new sensor readings to their device vertices and chains them in a time series
- Knowledge graphs — after an NLP pipeline extracts relationships from documents, you materialize thousands of typed edges between existing entity vertices
- Recommendation engines — nightly rebuild of ALSO_BOUGHT / SIMILAR_TO edges based on updated purchase data
- Incremental ETL — periodically sync new relationships from an external system into an existing graph
When NOT to Use It
- Small writes — for fewer than ~100 edges, the standard API is simpler and the importer overhead isn’t worth it
- Concurrent reads on the same vertices — the importer disables read-your-writes and manages its own transactions, so concurrent readers may see inconsistent state until
close() - Immediate edge visibility required — in parallel mode, incoming edges aren’t fully connected until
close()
For ongoing OLTP workloads with small, frequent writes, the standard transactional API remains the right choice — it provides full ACID guarantees with immediate visibility.
Runtime Usage Examples
Bulk Friend Import (Light Edges)
// Vertices already exist in the database
RID[] personRIDs = lookupExistingPersons(contactIds);
try (GraphBatch batch = database.batch()
.withBatchSize(50_000)
.withLightEdges(true)
.build()) {
for (int[] pair : contactPairs)
batch.newEdge(personRIDs[pair[0]], "KNOWS", personRIDs[pair[1]]);
}
IoT Sensor Linkage (with WAL for Crash Safety)
try (GraphBatch batch = database.batch()
.withBatchSize(100_000)
.withWAL(true)
.withCommitEvery(10_000)
.build()) {
for (SensorReading r : newReadings) {
batch.newEdge(r.deviceRID, "HAS_READING", r.rid, "timestamp", r.ts);
if (r.previousRID != null)
batch.newEdge(r.rid, "NEXT", r.previousRID);
}
}
Knowledge Graph Entity Resolution (with Edge Properties)
try (GraphBatch batch = database.batch()
.withBatchSize(200_000)
.withParallelFlush(true)
.build()) {
for (ExtractedRelation rel : relations)
batch.newEdge(rel.subjectRID, rel.edgeType, rel.objectRID,
"confidence", rel.score, "source", rel.docId);
}
Nightly Recommendation Rebuild
// Remove stale edges
database.command("sql", "DELETE EDGE ALSO_BOUGHT");
// Rebuild from recommendation engine output
try (GraphBatch batch = database.batch()
.withBatchSize(500_000)
.withLightEdges(true)
.build()) {
for (Recommendation rec : recommendations)
batch.newEdge(rec.productRID, "ALSO_BOUGHT", rec.relatedRID);
}
Incremental Sync from External Database
try (GraphBatch batch = database.batch()
.withBatchSize(100_000)
.withWAL(true)
.build()) {
try (ResultSet rs = externalDB.executeQuery(deltaQuery)) {
while (rs.next())
batch.newEdge(
lookupRID(rs.getString("from_id")),
"REPORTS_TO",
lookupRID(rs.getString("to_id")),
"since", rs.getDate("start_date"));
}
}
Tip: For runtime usage on production databases, enable WAL with
withWAL(true)for crash safety. For initial imports where you can re-run on failure, leaving WAL off maximizes throughput.
HTTP Batch Endpoint — GraphBatch for Every Language
GraphBatch is a Java API, but not everyone embeds ArcadeDB in a JVM application. That’s why v26.3.2 also ships a new HTTP batch endpoint that exposes the full power of GraphBatch over REST — no Java required.
POST /api/v1/batch/{database}
It supports two input formats: JSONL (newline-delimited JSON) and CSV. Both are streamed — the server never loads the entire payload into memory, so you can push millions of records in a single request.
JSONL Format
{"@type":"vertex","@class":"Person","@id":"t1","name":"Alice","age":30}
{"@type":"vertex","@class":"Person","@id":"t2","name":"Bob","age":25}
{"@type":"edge","@class":"KNOWS","@from":"t1","@to":"t2","since":2020}
CSV Format
@type,@class,@id,name,age
vertex,Person,t1,Alice,30
vertex,Person,t2,Bob,25
---
@type,@class,@from,@to,since
edge,KNOWS,t1,t2,2020
In both formats, vertices come first, then edges. Vertices can have temporary IDs (@id) that edges reference via @from/@to. Edges can also reference existing database RIDs directly (e.g., #12:0).
Temporary ID Mapping
The response includes an idMapping object so you know what RIDs were assigned:
{
"verticesCreated": 2,
"edgesCreated": 1,
"elapsedMs": 42,
"idMapping": {"t1": "#9:0", "t2": "#9:1"}
}
Tuning via Query Parameters
All GraphBatch configuration options are exposed as query parameters:
| Parameter | Default | Description |
|---|---|---|
batchSize |
100000 | Max edges buffered before auto-flush |
lightEdges |
false | Property-less edges stored as connectivity only (saves ~33% I/O) |
wal |
false | Enable Write-Ahead Logging for crash safety |
parallelFlush |
true | Parallelize edge connection across async threads |
preAllocateEdgeChunks |
true | Pre-allocate edge segments on vertex creation |
edgeListInitialSize |
2048 | Initial segment size in bytes (64–8192) |
bidirectional |
true | Connect both outgoing and incoming edges |
commitEvery |
50000 | Edges per sub-transaction within a flush |
expectedEdgeCount |
0 | Hint for auto-tuning batch size |
Examples
curl (JSONL):
curl -X POST "http://localhost:2480/api/v1/batch/mydb?lightEdges=true" \
-u root:password \
-H "Content-Type: application/x-ndjson" \
--data-binary @graph-data.jsonl
curl (CSV):
curl -X POST "http://localhost:2480/api/v1/batch/mydb" \
-u root:password \
-H "Content-Type: text/csv" \
--data-binary @graph-data.csv
Python:
import requests
data = (
'{"@type":"vertex","@class":"Person","@id":"p1","name":"Alice"}\n'
'{"@type":"vertex","@class":"Person","@id":"p2","name":"Bob"}\n'
'{"@type":"edge","@class":"KNOWS","@from":"p1","@to":"p2"}\n'
)
resp = requests.post(
"http://localhost:2480/api/v1/batch/mydb?lightEdges=true",
auth=("root", "password"),
headers={"Content-Type": "application/x-ndjson"},
data=data,
)
print(resp.json())
# {'verticesCreated': 2, 'edgesCreated': 1, 'elapsedMs': 15, 'idMapping': {'p1': '#9:0', 'p2': '#9:1'}}
JavaScript (Node.js):
const resp = await fetch("http://localhost:2480/api/v1/batch/mydb", {
method: "POST",
headers: {
"Content-Type": "application/x-ndjson",
Authorization: "Basic " + btoa("root:password"),
},
body: [
'{"@type":"vertex","@class":"Person","@id":"p1","name":"Alice"}',
'{"@type":"vertex","@class":"Person","@id":"p2","name":"Bob"}',
'{"@type":"edge","@class":"KNOWS","@from":"p1","@to":"p2"}',
].join("\n"),
});
console.log(await resp.json());
Tip: For maximum throughput, group vertices by type in the input. The endpoint batches consecutive same-type vertices into a single
createVertices()call. Interleaving types forces smaller batches.
Tip: The endpoint is NOT atomic by design — GraphBatch commits internally in chunks for maximum throughput. Treat it as a bulk-loading operation, not a transactional one. The response tells you exactly how many records were committed.
Get Started
GraphBatch is available starting from ArcadeDB v26.3.2. Check out the documentation for API details and usage examples.
Download ArcadeDB v26.3.2: GitHub Releases
If you have questions or feedback, join us on Discord or open an issue on GitHub.