What is GraphBatch in ArcadeDB?

GraphBatch is a new engine-level bulk import API introduced in ArcadeDB v26.3.2 that dramatically accelerates graph data ingestion by bypassing the overhead of the standard transactional API.

How much faster is GraphBatch compared to the standard API?

In benchmarks with 1 million vertices and 10 million edges, GraphBatch is 8.39x faster than the standard API for light edges and up to 11.77x faster at the 100K vertices / 1M edges scale.

Does GraphBatch support edge properties?

Yes, GraphBatch fully supports edge properties. With integer and long properties on edges, it still achieves a 4.97x speedup over the standard API.

Can I use GraphBatch at runtime on an existing database?

Yes. GraphBatch works on live databases with existing data — it doesn't require an empty database. As long as vertex and edge types exist in the schema and vertices have valid RIDs, you can bulk-create edges at any time. Common runtime scenarios include social contact imports, IoT sensor linkage, knowledge graph entity resolution, and nightly recommendation rebuilds.

Can I use GraphBatch from any programming language, not just Java?

Yes. ArcadeDB v26.3.2 includes a new HTTP batch endpoint (POST /api/v1/batch/{database}) that uses GraphBatch under the hood. You can send vertices and edges in JSONL or CSV format from any language. Additionally, a streaming gRPC API (GraphBatchLoad) lets you push data with client-streaming and built-in backpressure - ideal for high-throughput pipelines in Go, Python, C++, or any gRPC-supported language.

Does GraphBatch support gRPC?

Yes. ArcadeDB v26.3.2 includes a GraphBatchLoad streaming gRPC endpoint that wraps GraphBatch with client-streaming RPC and built-in backpressure. You can send vertex and edge records as typed Protobuf messages from any gRPC-supported language (Go, Python, Java, C++, Rust, Node.js, etc.), with the same configuration options and performance as the Java API.

GraphBatch: Up to 8x Faster Graph Ingestion in ArcadeDB

Q: Which ArcadeDB version includes GraphBatch?

GraphBatch is available starting from ArcadeDB v26.3.2.

If you’ve ever loaded millions of edges into a graph database, you know the pain: what should be a straightforward bulk import can take minutes - or even hours - as the transactional overhead stacks up. Today we’re introducing GraphBatch, a new engine-level API in ArcadeDB v26.3.2 that makes large-scale graph ingestion dramatically faster. And with the new HTTP batch endpoint and streaming gRPC API, you can leverage that power from any language.

Why a New Importer?

ArcadeDB has always offered two ways to load graph data: the standard transactional API (batching operations in explicit transactions) and the GraphImporter (an integration-level helper that manages batching for you). Both work well for moderate workloads, but at scale the transactional overhead becomes a bottleneck.

GraphBatch takes a fundamentally different approach. Instead of wrapping the standard API, it operates directly at the storage engine level, bypassing the transactional layer entirely during bulk import. The result: throughput that scales with your hardware, not your transaction size.

The Benchmark

We ran a series of benchmarks loading graphs of increasing size on the same hardware, measuring edges ingested per second. Here are the results.

1M Vertices, 10M Edges — Light Edges (No Properties)

Method	Time (ms)	Edges/sec	Speedup
Standard API (tx/1000)	267,140	37,434	1.00x
Old GraphImporter (integration)	97,160	102,923	2.75x
New GraphBatch (engine)	31,842	314,047	8.39x

The new importer is 8.39x faster than the standard API and 3.05x faster than the previous GraphImporter. What previously took nearly 4.5 minutes now completes in about 32 seconds.

1M Vertices, 10M Edges — Edges with Properties (int + long)

Method	Time (ms)	Edges/sec	Speedup
Standard API + props (tx/1000)	267,773	37,345	1.00x
New GraphBatch + props	53,893	185,554	4.97x

Even with properties on every edge, GraphBatch delivers a 4.97x speedup. The additional serialization cost is manageable because the engine-level approach avoids the per-transaction overhead that dominates at scale.

Scaling Behavior

This is where things get really interesting. We compared how each method behaves as the graph size increases:

Scale	Std API (edges/sec)	GraphBatch (edges/sec)	Speedup
10K vertices / 100K edges	241,644	1,025,019	4.24x
100K vertices / 1M edges	103,027	1,212,756	11.77x
1M vertices / 10M edges	37,434	314,047	8.39x

Two things stand out:

The standard API degrades significantly at scale — from 241K edges/sec at 100K edges down to just 37K edges/sec at 10M edges. This is expected: as the graph grows, transaction management, index maintenance, and page cache pressure all increase.
GraphBatch holds up far better — peaking at over 1.2 million edges per second at the 1M-edge scale. At the largest scale (10M edges), memory pressure naturally reduces throughput, but it still maintains 314K edges/sec — a strong result for a single machine.

The sweet spot appears to be around the 100K–1M vertex range, where GraphBatch reaches 11.77x the throughput of the standard API.

When to Use GraphBatch

GraphBatch is designed for bulk edge creation — whether that’s during initial data loading or at runtime on an existing database. It doesn’t require an empty database: as long as vertex and edge types exist in the schema and the source/destination vertices have valid RIDs, you’re good to go.

Initial Import Scenarios

Data migration — moving graph data from another database into ArcadeDB
ETL pipelines — loading large datasets from data warehouses or data lakes
Testing and benchmarking — quickly setting up large test graphs

Runtime Scenarios

GraphBatch works on live databases with existing data, making it the right tool whenever you need to create edges in bulk at runtime:

Social networks — a user imports their contact list and you need to create thousands of KNOWS edges between existing Person vertices
IoT / time series — a periodic job links new sensor readings to their device vertices and chains them in a time series
Knowledge graphs — after an NLP pipeline extracts relationships from documents, you materialize thousands of typed edges between existing entity vertices
Recommendation engines — nightly rebuild of ALSO_BOUGHT / SIMILAR_TO edges based on updated purchase data
Incremental ETL — periodically sync new relationships from an external system into an existing graph

When NOT to Use It

Small writes — for fewer than ~100 edges, the standard API is simpler and the importer overhead isn’t worth it
Concurrent reads on the same vertices — the importer disables read-your-writes and manages its own transactions, so concurrent readers may see inconsistent state until close()
Immediate edge visibility required — in parallel mode, incoming edges aren’t fully connected until close()

For ongoing OLTP workloads with small, frequent writes, the standard transactional API remains the right choice — it provides full ACID guarantees with immediate visibility.

Runtime Usage Examples

Bulk Friend Import (Light Edges)

// Vertices already exist in the database
RID[] personRIDs = lookupExistingPersons(contactIds);

try (GraphBatch batch = database.batch()
    .withBatchSize(50_000)
    .withLightEdges(true)
    .build()) {
  for (int[] pair : contactPairs)
    batch.newEdge(personRIDs[pair[0]], "KNOWS", personRIDs[pair[1]]);
}

IoT Sensor Linkage (with WAL for Crash Safety)

try (GraphBatch batch = database.batch()
    .withBatchSize(100_000)
    .withWAL(true)
    .withCommitEvery(10_000)
    .build()) {
  for (SensorReading r : newReadings) {
    batch.newEdge(r.deviceRID, "HAS_READING", r.rid, "timestamp", r.ts);
    if (r.previousRID != null)
      batch.newEdge(r.rid, "NEXT", r.previousRID);
  }
}

Knowledge Graph Entity Resolution (with Edge Properties)

try (GraphBatch batch = database.batch()
    .withBatchSize(200_000)
    .withParallelFlush(true)
    .build()) {
  for (ExtractedRelation rel : relations)
    batch.newEdge(rel.subjectRID, rel.edgeType, rel.objectRID,
        "confidence", rel.score, "source", rel.docId);
}

Nightly Recommendation Rebuild

// Remove stale edges
database.command("sql", "DELETE EDGE ALSO_BOUGHT");

// Rebuild from recommendation engine output
try (GraphBatch batch = database.batch()
    .withBatchSize(500_000)
    .withLightEdges(true)
    .build()) {
  for (Recommendation rec : recommendations)
    batch.newEdge(rec.productRID, "ALSO_BOUGHT", rec.relatedRID);
}

Incremental Sync from External Database

try (GraphBatch batch = database.batch()
    .withBatchSize(100_000)
    .withWAL(true)
    .build()) {
  try (ResultSet rs = externalDB.executeQuery(deltaQuery)) {
    while (rs.next())
      batch.newEdge(
          lookupRID(rs.getString("from_id")),
          "REPORTS_TO",
          lookupRID(rs.getString("to_id")),
          "since", rs.getDate("start_date"));
  }
}

Tip: For runtime usage on production databases, enable WAL with withWAL(true) for crash safety. For initial imports where you can re-run on failure, leaving WAL off maximizes throughput.

HTTP Batch Endpoint — GraphBatch for Every Language

GraphBatch is a Java API, but not everyone embeds ArcadeDB in a JVM application. That’s why v26.3.2 also ships a new HTTP batch endpoint that exposes the full power of GraphBatch over the HTTP API — no Java required.

POST /api/v1/batch/{database}

It supports two input formats: JSONL (newline-delimited JSON) and CSV. Both are streamed — the server never loads the entire payload into memory, so you can push millions of records in a single request.

JSONL Format

{"@type":"vertex","@class":"Person","@id":"t1","name":"Alice","age":30}
{"@type":"vertex","@class":"Person","@id":"t2","name":"Bob","age":25}
{"@type":"edge","@class":"KNOWS","@from":"t1","@to":"t2","since":2020}

CSV Format

@type,@class,@id,name,age
vertex,Person,t1,Alice,30
vertex,Person,t2,Bob,25
---
@type,@class,@from,@to,since
edge,KNOWS,t1,t2,2020

In both formats, vertices come first, then edges. Vertices can have temporary IDs (@id) that edges reference via @from/@to. Edges can also reference existing database RIDs directly (e.g., #12:0).

Temporary ID Mapping

The response includes an idMapping object so you know what RIDs were assigned:

{
  "verticesCreated": 2,
  "edgesCreated": 1,
  "elapsedMs": 42,
  "idMapping": {"t1": "#9:0", "t2": "#9:1"}
}

Tuning via Query Parameters

All GraphBatch configuration options are exposed as query parameters:

Parameter	Default	Description
`batchSize`	100000	Max edges buffered before auto-flush
`lightEdges`	false	Property-less edges stored as connectivity only (saves ~33% I/O)
`wal`	false	Enable Write-Ahead Logging for crash safety
`parallelFlush`	true	Parallelize edge connection across async threads
`preAllocateEdgeChunks`	true	Pre-allocate edge segments on vertex creation
`edgeListInitialSize`	2048	Initial segment size in bytes (64–8192)
`bidirectional`	true	Connect both outgoing and incoming edges
`commitEvery`	50000	Edges per sub-transaction within a flush
`expectedEdgeCount`	0	Hint for auto-tuning batch size

Examples

curl (JSONL):

curl -X POST "http://localhost:2480/api/v1/batch/mydb?lightEdges=true" \
  -u root:password \
  -H "Content-Type: application/x-ndjson" \
  --data-binary @graph-data.jsonl

curl (CSV):

curl -X POST "http://localhost:2480/api/v1/batch/mydb" \
  -u root:password \
  -H "Content-Type: text/csv" \
  --data-binary @graph-data.csv

Python:

import requests

data = (
    '{"@type":"vertex","@class":"Person","@id":"p1","name":"Alice"}\n'
    '{"@type":"vertex","@class":"Person","@id":"p2","name":"Bob"}\n'
    '{"@type":"edge","@class":"KNOWS","@from":"p1","@to":"p2"}\n'
)

resp = requests.post(
    "http://localhost:2480/api/v1/batch/mydb?lightEdges=true",
    auth=("root", "password"),
    headers={"Content-Type": "application/x-ndjson"},
    data=data,
)
print(resp.json())
# {'verticesCreated': 2, 'edgesCreated': 1, 'elapsedMs': 15, 'idMapping': {'p1': '#9:0', 'p2': '#9:1'}}

JavaScript (Node.js):

const resp = await fetch("http://localhost:2480/api/v1/batch/mydb", {
  method: "POST",
  headers: {
    "Content-Type": "application/x-ndjson",
    Authorization: "Basic " + btoa("root:password"),
  },
  body: [
    '{"@type":"vertex","@class":"Person","@id":"p1","name":"Alice"}',
    '{"@type":"vertex","@class":"Person","@id":"p2","name":"Bob"}',
    '{"@type":"edge","@class":"KNOWS","@from":"p1","@to":"p2"}',
  ].join("\n"),
});
console.log(await resp.json());

Tip: For maximum throughput, group vertices by type in the input. The endpoint batches consecutive same-type vertices into a single createVertices() call. Interleaving types forces smaller batches.

Tip: The endpoint is NOT atomic by design - GraphBatch commits internally in chunks for maximum throughput. Treat it as a bulk-loading operation, not a transactional one. The response tells you exactly how many records were committed.

gRPC Streaming API - GraphBatch with Backpressure

For high-throughput pipelines where HTTP overhead matters, v26.3.2 also ships a streaming gRPC endpoint that wraps GraphBatch. It uses client-streaming RPC with built-in flow control, so the server applies backpressure when it’s flushing to disk - your producer never overwhelms the database.

rpc GraphBatchLoad (stream GraphBatchChunk) returns (GraphBatchResult);

The client sends a stream of GraphBatchChunk messages, each containing a batch of vertex or edge records. The first chunk must include the database name and any configuration options. When the stream closes, the server returns a single GraphBatchResult with counts and the temporary ID-to-RID mapping.

Why gRPC?

	HTTP Batch	gRPC Streaming
Protocol	Single HTTP request, streamed body	Client-streaming RPC with backpressure
Backpressure	None (server buffers or drops)	Built-in flow control per chunk
Format	JSONL or CSV (text)	Protobuf (binary, typed)
Best for	Scripts, one-off imports, simple integrations	High-throughput pipelines, microservices, polyglot stacks
Language support	Any HTTP client	Go, Python, Java, C++, Rust, Node.js, and more

Both endpoints expose the same GraphBatch options and deliver the same engine-level performance. Choose gRPC when you need backpressure, binary efficiency, or native code generation from the proto file.

Message Structure

Each GraphBatchChunk contains:

database - the target database name (required on the first chunk)
credentials - optional authentication
options - GraphBatch configuration (same parameters as the HTTP endpoint)
records - a list of vertex or edge records

Records use the GraphBatchRecord message:

message GraphBatchRecord {
  enum Kind { VERTEX = 0; EDGE = 1; }
  Kind   kind      = 1;
  string type_name = 2;  // vertex or edge type name
  string temp_id   = 3;  // vertex temp ID (for edge references)
  string from_ref  = 4;  // edge source: temp ID or "#bucket:pos"
  string to_ref    = 5;  // edge target: temp ID or "#bucket:pos"
  map<string, GrpcValue> properties = 6;
}

Important: all vertex records must appear before any edge records across all chunks. Interleaving is not supported and will result in an error.

Python Example (grpcio)

import grpc
from arcadedb_pb2 import *
from arcadedb_pb2_grpc import ArcadeDbServiceStub

channel = grpc.insecure_channel("localhost:2424")
stub = ArcadeDbServiceStub(channel)

def generate_chunks():
    # First chunk: database, options, and initial vertices
    yield GraphBatchChunk(
        database="mydb",
        credentials=DatabaseCredentials(username="root", password="password"),
        options=GraphBatchOptions(light_edges=True, batch_size=100000),
        records=[
            GraphBatchRecord(kind=GraphBatchRecord.VERTEX,
                             type_name="Person", temp_id="p1",
                             properties={"name": GrpcValue(string_value="Alice")}),
            GraphBatchRecord(kind=GraphBatchRecord.VERTEX,
                             type_name="Person", temp_id="p2",
                             properties={"name": GrpcValue(string_value="Bob")}),
        ],
    )
    # Second chunk: edges referencing temp IDs
    yield GraphBatchChunk(
        records=[
            GraphBatchRecord(kind=GraphBatchRecord.EDGE,
                             type_name="KNOWS",
                             from_ref="p1", to_ref="p2"),
        ],
    )

result = stub.GraphBatchLoad(generate_chunks())
print(f"Created {result.vertices_created} vertices, {result.edges_created} edges "
      f"in {result.elapsed_ms}ms")
print(f"ID mapping: {dict(result.id_mapping)}")
# Created 2 vertices, 1 edges in 12ms
# ID mapping: {'p1': '#9:0', 'p2': '#9:1'}

Go Example

stream, err := client.GraphBatchLoad(ctx)
if err != nil {
    log.Fatal(err)
}

// First chunk with vertices
stream.Send(&pb.GraphBatchChunk{
    Database:    "mydb",
    Credentials: &pb.DatabaseCredentials{Username: "root", Password: "password"},
    Options:     &pb.GraphBatchOptions{LightEdges: true},
    Records: []*pb.GraphBatchRecord{
        {Kind: pb.GraphBatchRecord_VERTEX, TypeName: "Person",
         TempId: "p1", Properties: map[string]*pb.GrpcValue{
            "name": {Value: &pb.GrpcValue_StringValue{StringValue: "Alice"}},
        }},
        {Kind: pb.GraphBatchRecord_VERTEX, TypeName: "Person",
         TempId: "p2", Properties: map[string]*pb.GrpcValue{
            "name": {Value: &pb.GrpcValue_StringValue{StringValue: "Bob"}},
        }},
    },
})

// Second chunk with edges
stream.Send(&pb.GraphBatchChunk{
    Records: []*pb.GraphBatchRecord{
        {Kind: pb.GraphBatchRecord_EDGE, TypeName: "KNOWS",
         FromRef: "p1", ToRef: "p2"},
    },
})

result, err := stream.CloseAndRecv()
fmt.Printf("Created %d vertices, %d edges in %dms\n",
    result.VerticesCreated, result.EdgesCreated, result.ElapsedMs)

Java Example (generated stubs)

StreamObserver<GraphBatchResult> responseObserver = new StreamObserver<>() {
    @Override
    public void onNext(GraphBatchResult result) {
        System.out.printf("Created %d vertices, %d edges in %dms%n",
            result.getVerticesCreated(), result.getEdgesCreated(), result.getElapsedMs());
    }
    @Override public void onError(Throwable t) { t.printStackTrace(); }
    @Override public void onCompleted() { }
};

StreamObserver<GraphBatchChunk> requestStream = stub.graphBatchLoad(responseObserver);

// Send vertices
requestStream.onNext(GraphBatchChunk.newBuilder()
    .setDatabase("mydb")
    .setCredentials(DatabaseCredentials.newBuilder()
        .setUsername("root").setPassword("password"))
    .setOptions(GraphBatchOptions.newBuilder().setLightEdges(true))
    .addRecords(GraphBatchRecord.newBuilder()
        .setKind(GraphBatchRecord.Kind.VERTEX)
        .setTypeName("Person").setTempId("p1")
        .putProperties("name", GrpcValue.newBuilder().setStringValue("Alice").build()))
    .addRecords(GraphBatchRecord.newBuilder()
        .setKind(GraphBatchRecord.Kind.VERTEX)
        .setTypeName("Person").setTempId("p2")
        .putProperties("name", GrpcValue.newBuilder().setStringValue("Bob").build()))
    .build());

// Send edges
requestStream.onNext(GraphBatchChunk.newBuilder()
    .addRecords(GraphBatchRecord.newBuilder()
        .setKind(GraphBatchRecord.Kind.EDGE)
        .setTypeName("KNOWS").setFromRef("p1").setToRef("p2"))
    .build());

requestStream.onCompleted();

Tip: For very large imports with millions of vertices using temp IDs, the id_mapping in the response may exceed the default gRPC message size limit (4 MB). In that case, increase maxInboundMessageSize on the client, or skip temp IDs when you don’t need the RID mapping back.

Tip: Like the HTTP endpoint, the gRPC streaming API is NOT atomic - GraphBatch commits internally in chunks. If the stream is interrupted mid-flight, records already flushed are committed. Design your pipeline for idempotent re-runs.

Get Started

GraphBatch is available starting from ArcadeDB v26.3.2. Check out the documentation for API details and usage examples.

Download ArcadeDB v26.3.2: GitHub Releases

If you have questions or feedback, join us on Discord or open an issue on GitHub.