If you’ve ever loaded millions of edges into a graph database, you know the pain: what should be a straightforward bulk import can take minutes - or even hours - as the transactional overhead stacks up. Today we’re introducing GraphBatch, a new engine-level API in ArcadeDB v26.3.2 that makes large-scale graph ingestion dramatically faster. And with the new HTTP batch endpoint and streaming gRPC API, you can leverage that power from any language.
Why a New Importer?
ArcadeDB has always offered two ways to load graph data: the standard transactional API (batching operations in explicit transactions) and the GraphImporter (an integration-level helper that manages batching for you). Both work well for moderate workloads, but at scale the transactional overhead becomes a bottleneck.
GraphBatch takes a fundamentally different approach. Instead of wrapping the standard API, it operates directly at the storage engine level, bypassing the transactional layer entirely during bulk import. The result: throughput that scales with your hardware, not your transaction size.
The Benchmark
We ran a series of benchmarks loading graphs of increasing size on the same hardware, measuring edges ingested per second. Here are the results.
1M Vertices, 10M Edges — Light Edges (No Properties)
| Method | Time (ms) | Edges/sec | Speedup |
|---|---|---|---|
| Standard API (tx/1000) | 267,140 | 37,434 | 1.00x |
| Old GraphImporter (integration) | 97,160 | 102,923 | 2.75x |
| New GraphBatch (engine) | 31,842 | 314,047 | 8.39x |
The new importer is 8.39x faster than the standard API and 3.05x faster than the previous GraphImporter. What previously took nearly 4.5 minutes now completes in about 32 seconds.
1M Vertices, 10M Edges — Edges with Properties (int + long)
| Method | Time (ms) | Edges/sec | Speedup |
|---|---|---|---|
| Standard API + props (tx/1000) | 267,773 | 37,345 | 1.00x |
| New GraphBatch + props | 53,893 | 185,554 | 4.97x |
Even with properties on every edge, GraphBatch delivers a 4.97x speedup. The additional serialization cost is manageable because the engine-level approach avoids the per-transaction overhead that dominates at scale.
Scaling Behavior
This is where things get really interesting. We compared how each method behaves as the graph size increases:
| Scale | Std API (edges/sec) | GraphBatch (edges/sec) | Speedup |
|---|---|---|---|
| 10K vertices / 100K edges | 241,644 | 1,025,019 | 4.24x |
| 100K vertices / 1M edges | 103,027 | 1,212,756 | 11.77x |
| 1M vertices / 10M edges | 37,434 | 314,047 | 8.39x |
Two things stand out:
-
The standard API degrades significantly at scale — from 241K edges/sec at 100K edges down to just 37K edges/sec at 10M edges. This is expected: as the graph grows, transaction management, index maintenance, and page cache pressure all increase.
-
GraphBatch holds up far better — peaking at over 1.2 million edges per second at the 1M-edge scale. At the largest scale (10M edges), memory pressure naturally reduces throughput, but it still maintains 314K edges/sec — a strong result for a single machine.
The sweet spot appears to be around the 100K–1M vertex range, where GraphBatch reaches 11.77x the throughput of the standard API.
When to Use GraphBatch
GraphBatch is designed for bulk edge creation — whether that’s during initial data loading or at runtime on an existing database. It doesn’t require an empty database: as long as vertex and edge types exist in the schema and the source/destination vertices have valid RIDs, you’re good to go.
Initial Import Scenarios
- Data migration — moving graph data from another database into ArcadeDB
- ETL pipelines — loading large datasets from data warehouses or data lakes
- Testing and benchmarking — quickly setting up large test graphs
Runtime Scenarios
GraphBatch works on live databases with existing data, making it the right tool whenever you need to create edges in bulk at runtime:
- Social networks — a user imports their contact list and you need to create thousands of KNOWS edges between existing Person vertices
- IoT / time series — a periodic job links new sensor readings to their device vertices and chains them in a time series
- Knowledge graphs — after an NLP pipeline extracts relationships from documents, you materialize thousands of typed edges between existing entity vertices
- Recommendation engines — nightly rebuild of ALSO_BOUGHT / SIMILAR_TO edges based on updated purchase data
- Incremental ETL — periodically sync new relationships from an external system into an existing graph
When NOT to Use It
- Small writes — for fewer than ~100 edges, the standard API is simpler and the importer overhead isn’t worth it
- Concurrent reads on the same vertices — the importer disables read-your-writes and manages its own transactions, so concurrent readers may see inconsistent state until
close() - Immediate edge visibility required — in parallel mode, incoming edges aren’t fully connected until
close()
For ongoing OLTP workloads with small, frequent writes, the standard transactional API remains the right choice — it provides full ACID guarantees with immediate visibility.
Runtime Usage Examples
Bulk Friend Import (Light Edges)
// Vertices already exist in the database
RID[] personRIDs = lookupExistingPersons(contactIds);
try (GraphBatch batch = database.batch()
.withBatchSize(50_000)
.withLightEdges(true)
.build()) {
for (int[] pair : contactPairs)
batch.newEdge(personRIDs[pair[0]], "KNOWS", personRIDs[pair[1]]);
}
IoT Sensor Linkage (with WAL for Crash Safety)
try (GraphBatch batch = database.batch()
.withBatchSize(100_000)
.withWAL(true)
.withCommitEvery(10_000)
.build()) {
for (SensorReading r : newReadings) {
batch.newEdge(r.deviceRID, "HAS_READING", r.rid, "timestamp", r.ts);
if (r.previousRID != null)
batch.newEdge(r.rid, "NEXT", r.previousRID);
}
}
Knowledge Graph Entity Resolution (with Edge Properties)
try (GraphBatch batch = database.batch()
.withBatchSize(200_000)
.withParallelFlush(true)
.build()) {
for (ExtractedRelation rel : relations)
batch.newEdge(rel.subjectRID, rel.edgeType, rel.objectRID,
"confidence", rel.score, "source", rel.docId);
}
Nightly Recommendation Rebuild
// Remove stale edges
database.command("sql", "DELETE EDGE ALSO_BOUGHT");
// Rebuild from recommendation engine output
try (GraphBatch batch = database.batch()
.withBatchSize(500_000)
.withLightEdges(true)
.build()) {
for (Recommendation rec : recommendations)
batch.newEdge(rec.productRID, "ALSO_BOUGHT", rec.relatedRID);
}
Incremental Sync from External Database
try (GraphBatch batch = database.batch()
.withBatchSize(100_000)
.withWAL(true)
.build()) {
try (ResultSet rs = externalDB.executeQuery(deltaQuery)) {
while (rs.next())
batch.newEdge(
lookupRID(rs.getString("from_id")),
"REPORTS_TO",
lookupRID(rs.getString("to_id")),
"since", rs.getDate("start_date"));
}
}
Tip: For runtime usage on production databases, enable WAL with
withWAL(true)for crash safety. For initial imports where you can re-run on failure, leaving WAL off maximizes throughput.
HTTP Batch Endpoint — GraphBatch for Every Language
GraphBatch is a Java API, but not everyone embeds ArcadeDB in a JVM application. That’s why v26.3.2 also ships a new HTTP batch endpoint that exposes the full power of GraphBatch over the HTTP API — no Java required.
POST /api/v1/batch/{database}
It supports two input formats: JSONL (newline-delimited JSON) and CSV. Both are streamed — the server never loads the entire payload into memory, so you can push millions of records in a single request.
JSONL Format
{"@type":"vertex","@class":"Person","@id":"t1","name":"Alice","age":30}
{"@type":"vertex","@class":"Person","@id":"t2","name":"Bob","age":25}
{"@type":"edge","@class":"KNOWS","@from":"t1","@to":"t2","since":2020}
CSV Format
@type,@class,@id,name,age
vertex,Person,t1,Alice,30
vertex,Person,t2,Bob,25
---
@type,@class,@from,@to,since
edge,KNOWS,t1,t2,2020
In both formats, vertices come first, then edges. Vertices can have temporary IDs (@id) that edges reference via @from/@to. Edges can also reference existing database RIDs directly (e.g., #12:0).
Temporary ID Mapping
The response includes an idMapping object so you know what RIDs were assigned:
{
"verticesCreated": 2,
"edgesCreated": 1,
"elapsedMs": 42,
"idMapping": {"t1": "#9:0", "t2": "#9:1"}
}
Tuning via Query Parameters
All GraphBatch configuration options are exposed as query parameters:
| Parameter | Default | Description |
|---|---|---|
batchSize |
100000 | Max edges buffered before auto-flush |
lightEdges |
false | Property-less edges stored as connectivity only (saves ~33% I/O) |
wal |
false | Enable Write-Ahead Logging for crash safety |
parallelFlush |
true | Parallelize edge connection across async threads |
preAllocateEdgeChunks |
true | Pre-allocate edge segments on vertex creation |
edgeListInitialSize |
2048 | Initial segment size in bytes (64–8192) |
bidirectional |
true | Connect both outgoing and incoming edges |
commitEvery |
50000 | Edges per sub-transaction within a flush |
expectedEdgeCount |
0 | Hint for auto-tuning batch size |
Examples
curl (JSONL):
curl -X POST "http://localhost:2480/api/v1/batch/mydb?lightEdges=true" \
-u root:password \
-H "Content-Type: application/x-ndjson" \
--data-binary @graph-data.jsonl
curl (CSV):
curl -X POST "http://localhost:2480/api/v1/batch/mydb" \
-u root:password \
-H "Content-Type: text/csv" \
--data-binary @graph-data.csv
Python:
import requests
data = (
'{"@type":"vertex","@class":"Person","@id":"p1","name":"Alice"}\n'
'{"@type":"vertex","@class":"Person","@id":"p2","name":"Bob"}\n'
'{"@type":"edge","@class":"KNOWS","@from":"p1","@to":"p2"}\n'
)
resp = requests.post(
"http://localhost:2480/api/v1/batch/mydb?lightEdges=true",
auth=("root", "password"),
headers={"Content-Type": "application/x-ndjson"},
data=data,
)
print(resp.json())
# {'verticesCreated': 2, 'edgesCreated': 1, 'elapsedMs': 15, 'idMapping': {'p1': '#9:0', 'p2': '#9:1'}}
JavaScript (Node.js):
const resp = await fetch("http://localhost:2480/api/v1/batch/mydb", {
method: "POST",
headers: {
"Content-Type": "application/x-ndjson",
Authorization: "Basic " + btoa("root:password"),
},
body: [
'{"@type":"vertex","@class":"Person","@id":"p1","name":"Alice"}',
'{"@type":"vertex","@class":"Person","@id":"p2","name":"Bob"}',
'{"@type":"edge","@class":"KNOWS","@from":"p1","@to":"p2"}',
].join("\n"),
});
console.log(await resp.json());
Tip: For maximum throughput, group vertices by type in the input. The endpoint batches consecutive same-type vertices into a single
createVertices()call. Interleaving types forces smaller batches.
Tip: The endpoint is NOT atomic by design - GraphBatch commits internally in chunks for maximum throughput. Treat it as a bulk-loading operation, not a transactional one. The response tells you exactly how many records were committed.
gRPC Streaming API - GraphBatch with Backpressure
For high-throughput pipelines where HTTP overhead matters, v26.3.2 also ships a streaming gRPC endpoint that wraps GraphBatch. It uses client-streaming RPC with built-in flow control, so the server applies backpressure when it’s flushing to disk - your producer never overwhelms the database.
rpc GraphBatchLoad (stream GraphBatchChunk) returns (GraphBatchResult);
The client sends a stream of GraphBatchChunk messages, each containing a batch of vertex or edge records. The first chunk must include the database name and any configuration options. When the stream closes, the server returns a single GraphBatchResult with counts and the temporary ID-to-RID mapping.
Why gRPC?
| HTTP Batch | gRPC Streaming | |
|---|---|---|
| Protocol | Single HTTP request, streamed body | Client-streaming RPC with backpressure |
| Backpressure | None (server buffers or drops) | Built-in flow control per chunk |
| Format | JSONL or CSV (text) | Protobuf (binary, typed) |
| Best for | Scripts, one-off imports, simple integrations | High-throughput pipelines, microservices, polyglot stacks |
| Language support | Any HTTP client | Go, Python, Java, C++, Rust, Node.js, and more |
Both endpoints expose the same GraphBatch options and deliver the same engine-level performance. Choose gRPC when you need backpressure, binary efficiency, or native code generation from the proto file.
Message Structure
Each GraphBatchChunk contains:
database- the target database name (required on the first chunk)credentials- optional authenticationoptions- GraphBatch configuration (same parameters as the HTTP endpoint)records- a list of vertex or edge records
Records use the GraphBatchRecord message:
message GraphBatchRecord {
enum Kind { VERTEX = 0; EDGE = 1; }
Kind kind = 1;
string type_name = 2; // vertex or edge type name
string temp_id = 3; // vertex temp ID (for edge references)
string from_ref = 4; // edge source: temp ID or "#bucket:pos"
string to_ref = 5; // edge target: temp ID or "#bucket:pos"
map<string, GrpcValue> properties = 6;
}
Important: all vertex records must appear before any edge records across all chunks. Interleaving is not supported and will result in an error.
Python Example (grpcio)
import grpc
from arcadedb_pb2 import *
from arcadedb_pb2_grpc import ArcadeDbServiceStub
channel = grpc.insecure_channel("localhost:2424")
stub = ArcadeDbServiceStub(channel)
def generate_chunks():
# First chunk: database, options, and initial vertices
yield GraphBatchChunk(
database="mydb",
credentials=DatabaseCredentials(username="root", password="password"),
options=GraphBatchOptions(light_edges=True, batch_size=100000),
records=[
GraphBatchRecord(kind=GraphBatchRecord.VERTEX,
type_name="Person", temp_id="p1",
properties={"name": GrpcValue(string_value="Alice")}),
GraphBatchRecord(kind=GraphBatchRecord.VERTEX,
type_name="Person", temp_id="p2",
properties={"name": GrpcValue(string_value="Bob")}),
],
)
# Second chunk: edges referencing temp IDs
yield GraphBatchChunk(
records=[
GraphBatchRecord(kind=GraphBatchRecord.EDGE,
type_name="KNOWS",
from_ref="p1", to_ref="p2"),
],
)
result = stub.GraphBatchLoad(generate_chunks())
print(f"Created {result.vertices_created} vertices, {result.edges_created} edges "
f"in {result.elapsed_ms}ms")
print(f"ID mapping: {dict(result.id_mapping)}")
# Created 2 vertices, 1 edges in 12ms
# ID mapping: {'p1': '#9:0', 'p2': '#9:1'}
Go Example
stream, err := client.GraphBatchLoad(ctx)
if err != nil {
log.Fatal(err)
}
// First chunk with vertices
stream.Send(&pb.GraphBatchChunk{
Database: "mydb",
Credentials: &pb.DatabaseCredentials{Username: "root", Password: "password"},
Options: &pb.GraphBatchOptions{LightEdges: true},
Records: []*pb.GraphBatchRecord{
{Kind: pb.GraphBatchRecord_VERTEX, TypeName: "Person",
TempId: "p1", Properties: map[string]*pb.GrpcValue{
"name": {Value: &pb.GrpcValue_StringValue{StringValue: "Alice"}},
}},
{Kind: pb.GraphBatchRecord_VERTEX, TypeName: "Person",
TempId: "p2", Properties: map[string]*pb.GrpcValue{
"name": {Value: &pb.GrpcValue_StringValue{StringValue: "Bob"}},
}},
},
})
// Second chunk with edges
stream.Send(&pb.GraphBatchChunk{
Records: []*pb.GraphBatchRecord{
{Kind: pb.GraphBatchRecord_EDGE, TypeName: "KNOWS",
FromRef: "p1", ToRef: "p2"},
},
})
result, err := stream.CloseAndRecv()
fmt.Printf("Created %d vertices, %d edges in %dms\n",
result.VerticesCreated, result.EdgesCreated, result.ElapsedMs)
Java Example (generated stubs)
StreamObserver<GraphBatchResult> responseObserver = new StreamObserver<>() {
@Override
public void onNext(GraphBatchResult result) {
System.out.printf("Created %d vertices, %d edges in %dms%n",
result.getVerticesCreated(), result.getEdgesCreated(), result.getElapsedMs());
}
@Override public void onError(Throwable t) { t.printStackTrace(); }
@Override public void onCompleted() { }
};
StreamObserver<GraphBatchChunk> requestStream = stub.graphBatchLoad(responseObserver);
// Send vertices
requestStream.onNext(GraphBatchChunk.newBuilder()
.setDatabase("mydb")
.setCredentials(DatabaseCredentials.newBuilder()
.setUsername("root").setPassword("password"))
.setOptions(GraphBatchOptions.newBuilder().setLightEdges(true))
.addRecords(GraphBatchRecord.newBuilder()
.setKind(GraphBatchRecord.Kind.VERTEX)
.setTypeName("Person").setTempId("p1")
.putProperties("name", GrpcValue.newBuilder().setStringValue("Alice").build()))
.addRecords(GraphBatchRecord.newBuilder()
.setKind(GraphBatchRecord.Kind.VERTEX)
.setTypeName("Person").setTempId("p2")
.putProperties("name", GrpcValue.newBuilder().setStringValue("Bob").build()))
.build());
// Send edges
requestStream.onNext(GraphBatchChunk.newBuilder()
.addRecords(GraphBatchRecord.newBuilder()
.setKind(GraphBatchRecord.Kind.EDGE)
.setTypeName("KNOWS").setFromRef("p1").setToRef("p2"))
.build());
requestStream.onCompleted();
Tip: For very large imports with millions of vertices using temp IDs, the
id_mappingin the response may exceed the default gRPC message size limit (4 MB). In that case, increasemaxInboundMessageSizeon the client, or skip temp IDs when you don’t need the RID mapping back.
Tip: Like the HTTP endpoint, the gRPC streaming API is NOT atomic - GraphBatch commits internally in chunks. If the stream is interrupted mid-flight, records already flushed are committed. Design your pipeline for idempotent re-runs.
Get Started
GraphBatch is available starting from ArcadeDB v26.3.2. Check out the documentation for API details and usage examples.
Download ArcadeDB v26.3.2: GitHub Releases
If you have questions or feedback, join us on Discord or open an issue on GitHub.