Importing a real-world dataset into a graph database usually means writing a custom ETL script: parse the files, resolve foreign keys, batch your transactions, handle edge cases. It works, but it’s tedious, error-prone, and you end up throwing away the script once the import is done.
ArcadeDB v26.3.2 introduces the GraphImporter — a declarative tool that turns CSV, XML, and JSONL files into a fully connected graph using nothing but a JSON configuration file. No code, no custom scripts. Under the hood it uses the GraphBatch engine for maximum throughput.
Let’s see how it works by importing a real dataset: the StackOverflow data dump.
The StackOverflow Graph Model
The StackOverflow data dump is a classic dataset for benchmarking and graph analysis. It ships as a set of XML files, each representing a table in the original relational schema. Here’s the graph model we’ll build:
ASKED TAGGED_WITH
User ──────────────> Question ──────────────> Tag
│ │ ^
│ ANSWERED │ │ HAS_ANSWER
v │ │
Answer <──────────────┘ │
^ │
│ ACCEPTED_ANSWER │
└────────────────────────┘
User ──WROTE_COMMENT──> Comment ──COMMENTED_ON──> Question/Answer
User ──EARNED──> Badge
Question ──LINKED_TO──> Question
Six vertex types, eight edge types, all derived from six XML files. Let’s see how to express this as a single JSON configuration.
The Import Configuration
Here’s the complete JSON file that defines the entire import:
{
"vertices": [
{
"type": "Tag", "file": "Tags.xml", "id": "Id", "nameId": "TagName",
"properties": { "Id": "int:Id", "TagName": "TagName", "Count": "int:Count" }
},
{
"type": "User", "file": "Users.xml", "id": "Id",
"properties": {
"Id": "int:Id", "DisplayName": "DisplayName", "Reputation": "int:Reputation",
"CreationDate": "CreationDate", "Views": "int:Views",
"UpVotes": "int:UpVotes", "DownVotes": "int:DownVotes"
}
},
{
"type": "Question", "file": "Posts.xml", "id": "Id", "filter": "PostTypeId=1",
"properties": {
"Id": "int:Id", "Title": "Title", "Body": "Body", "Score": "int:Score",
"ViewCount": "int:ViewCount", "CreationDate": "CreationDate",
"AnswerCount": "int:AnswerCount", "CommentCount": "int:CommentCount", "Tags": "Tags"
},
"edges": [
{ "attribute": "OwnerUserId", "edge": "ASKED", "target": "User", "direction": "in" },
{ "attribute": "Tags", "edge": "TAGGED_WITH", "target": "Tag", "split": "|" }
]
},
{
"type": "Answer", "file": "Posts.xml", "id": "Id", "filter": "PostTypeId=2",
"properties": {
"Id": "int:Id", "Body": "Body", "Score": "int:Score",
"CreationDate": "CreationDate", "CommentCount": "int:CommentCount"
},
"edges": [
{ "attribute": "OwnerUserId", "edge": "ANSWERED", "target": "User", "direction": "in" },
{ "attribute": "ParentId", "edge": "HAS_ANSWER", "target": "Question", "direction": "in" }
]
},
{
"type": "Comment", "file": "Comments.xml", "id": "Id",
"properties": {
"Id": "int:Id", "Score": "int:Score", "CreationDate": "CreationDate", "Text": "Text"
},
"edges": [
{ "attribute": "PostId", "edge": "COMMENTED_ON", "target": "Question" },
{ "attribute": "PostId", "edge": "COMMENTED_ON_ANSWER", "target": "Answer" },
{ "attribute": "UserId", "edge": "WROTE_COMMENT", "target": "User", "direction": "in" }
]
},
{
"type": "Badge", "file": "Badges.xml", "id": "Id",
"properties": {
"Id": "int:Id", "Name": "Name", "Date": "Date",
"BadgeClass": "int:Class", "TagBased": "bool:TagBased"
},
"edges": [
{ "attribute": "UserId", "edge": "EARNED", "target": "User", "direction": "in" }
]
}
],
"edgeSources": [
{
"edge": "ACCEPTED_ANSWER", "file": "Posts.xml",
"from": "Id:Question", "to": "AcceptedAnswerId:Answer"
},
{
"edge": "LINKED_TO", "file": "PostLinks.xml",
"from": "PostId:Question", "to": "RelatedPostId:Question",
"properties": { "LinkType": "int:LinkTypeId" }
}
],
"postImportCommands": [
{
"language": "sql",
"command": "CREATE GRAPH ANALYTICAL VIEW IF NOT EXISTS stackoverflow PROPERTIES (`!Body`, `!Text`) UPDATE MODE SYNCHRONOUS"
}
]
}
That’s it. Six vertex types, eight edge types, post-import Graph Analytical View — all in one file. Let’s break down the key patterns.
Key Patterns Explained
Splitting One File into Multiple Vertex Types
StackOverflow stores both questions and answers in the same Posts.xml file, distinguished by PostTypeId. The filter option lets you import them as separate vertex types:
{ "type": "Question", "file": "Posts.xml", "filter": "PostTypeId=1", ... }
{ "type": "Answer", "file": "Posts.xml", "filter": "PostTypeId=2", ... }
The importer reads Posts.xml once per definition, but only creates vertices for rows matching the filter. This is a common pattern when a single source table contains multiple logical entity types.
Foreign Key Resolution
Most edges are derived from foreign key attributes in the source data. The importer needs to know two things: which attribute holds the foreign key, and which vertex type it references.
Outgoing edges — the foreign key is in this vertex’s source, pointing to the target:
{ "attribute": "ParentId", "edge": "HAS_ANSWER", "target": "Question", "direction": "in" }
This means: “read ParentId from each Answer row, find the Question with that ID, and create a HAS_ANSWER edge from the Question to this Answer”. The "direction": "in" flips the edge so the Question is the source (the question has an answer, not the other way around).
Default direction is "out" — the current vertex is the edge source:
{ "attribute": "PostId", "edge": "COMMENTED_ON", "target": "Question" }
This creates an edge from Comment to Question.
Split-Field Edges (Multi-Value Attributes)
StackOverflow stores tags as a single delimited string like <java><python><sql>. The split option expands this into multiple edges:
{ "attribute": "Tags", "edge": "TAGGED_WITH", "target": "Tag", "split": "|" }
For a question tagged |java|python|sql|, this creates three TAGGED_WITH edges — one to each Tag vertex. The split values are resolved using the target’s nameId attribute (in this case, TagName), not the integer id.
Edge-Only Sources
Some relationships live in their own source file rather than as foreign keys in a vertex file. The edgeSources section handles these:
{
"edge": "LINKED_TO", "file": "PostLinks.xml",
"from": "PostId:Question", "to": "RelatedPostId:Question",
"properties": { "LinkType": "int:LinkTypeId" }
}
The compact "attribute:vertexType" syntax tells the importer which attribute to read and which vertex type to resolve against. Both endpoints must already exist (vertex sources are processed first).
Property Type Mapping
Properties are strings by default. Prefix the source attribute with a type hint for automatic conversion:
| Syntax | Type | Example |
|---|---|---|
"DisplayName" |
String | "name": "DisplayName" |
"int:Score" |
Integer | "score": "int:Score" |
"bool:TagBased" |
Boolean | "tagBased": "bool:TagBased" |
Post-Import Commands
The postImportCommands array runs SQL (or any supported language) after the import completes. In this example, we create a Graph Analytical View that pre-computes the graph structure for fast OLAP queries, excluding large text properties (Body, Text) to keep the view compact:
{
"language": "sql",
"command": "CREATE GRAPH ANALYTICAL VIEW IF NOT EXISTS stackoverflow PROPERTIES (`!Body`, `!Text`) UPDATE MODE SYNCHRONOUS"
}
Running the Import
From the Command Line
java com.arcadedb.integration.importer.graph.GraphImporter \
stackoverflow-import.json \
/path/to/database \
/path/to/stackoverflow-data
The importer auto-creates the schema (vertex and edge types) from the JSON config, runs the two-pass import, and executes post-import commands. File paths in the JSON are resolved relative to the data directory (third argument).
From Java
Database database = new DatabaseFactory("/path/to/database").create();
String json = Files.readString(Path.of("stackoverflow-import.json"));
GraphImporter.createSchemaFromConfig(database, new JSONObject(json));
try (GraphImporter importer = GraphImporter.fromJSON(database, json, "/path/to/data")) {
importer.run();
System.out.printf("Vertices: %,d Edges: %,d%n",
importer.getVertexCount(), importer.getEdgeCount());
}
GraphImporter.executePostImportCommands(database, new JSONObject(json));
Programmatic Builder API
If you prefer code over JSON, the same import can be expressed with the builder:
GraphImporter.builder(database)
.vertex("Tag", XmlRowSource.from(dataDir, "Tags.xml"), v -> {
v.id("Id");
v.idByName("TagName");
v.property("TagName", "TagName");
v.intProperty("Count", "Count");
})
.vertex("User", XmlRowSource.from(dataDir, "Users.xml"), v -> {
v.id("Id");
v.property("DisplayName", "DisplayName");
v.intProperty("Reputation", "Reputation");
})
.vertex("Question", XmlRowSource.from(dataDir, "Posts.xml"), v -> {
v.filter("PostTypeId", "1");
v.id("Id");
v.property("Title", "Title");
v.intProperty("Score", "Score");
v.edgeIn("OwnerUserId", "ASKED", "User");
v.splitEdge("Tags", "TAGGED_WITH", "Tag", "|");
})
// ... remaining vertex and edge sources
.build()
.run();
How It Works Under the Hood
The GraphImporter uses a two-pass, CSR-first (Compressed Sparse Row) architecture:
Pass 1 — Vertices and topology collection. Each data source is read once. Vertices are created with full properties and flushed to disk immediately. Foreign key values are collected as compressed primitive arrays (int arrays for IDs, bucket/position pairs for RIDs) — no objects, no boxing, minimal GC pressure.
Pass 2 — Edge creation. The collected topology is fed into GraphBatch, which creates all edges with bidirectional traversal support. Each edge type is processed as a single batch for maximum sequential I/O.
This design means vertex data doesn’t stay in memory — only the graph topology does. For a dataset with 8 million vertices and 15 million edges, the in-memory topology is roughly 300 MB.
Supported Data Sources
| Format | Auto-detected | Notes |
|---|---|---|
| CSV | .csv |
Configurable delimiter ("delimiter": ",") and skip lines ("skipLines": 1) |
| JSONL | .jsonl, .ndjson |
One JSON object per line |
| XML | .xml |
Attribute-based by default (StackOverflow-style <row .../>). Set "element": "book" for child-element parsing |
All sources are streamed — the importer never loads an entire file into memory.
Configuration Reference
Vertex Source
| Field | Required | Description |
|---|---|---|
type |
Yes | ArcadeDB vertex type name (auto-created if missing) |
file |
Yes | Source file path, relative to the data directory |
id |
No | Integer primary key attribute for edge resolution |
nameId |
No | String-based secondary key (for split-field edge resolution) |
filter |
No | Row filter: "attribute=value" — only matching rows are imported |
properties |
No | Map of "dbPropertyName": "SourceAttr" (or "int:Attr", "bool:Attr") |
edges |
No | Array of edge definitions derived from foreign keys in this source |
Edge Definition (inside a vertex source)
| Field | Required | Description |
|---|---|---|
attribute |
Yes | Source attribute containing the foreign key value |
edge |
Yes | ArcadeDB edge type name (auto-created if missing) |
target |
Yes | Target vertex type the foreign key references |
direction |
No | "out" (default) or "in" — controls edge direction |
split |
No | Delimiter for multi-value fields (creates one edge per value) |
Edge-Only Source
| Field | Required | Description |
|---|---|---|
edge |
Yes | ArcadeDB edge type name |
file |
Yes | Source file path |
from |
Yes | "attribute:vertexType" — source vertex reference |
to |
Yes | "attribute:vertexType" — target vertex reference |
properties |
No | Map of "dbPropertyName": "int:SourceAttr" |
Get Started
The GraphImporter is available starting from ArcadeDB v26.3.2. Download the StackOverflow data dump, grab the JSON config above, and you’ll have a fully connected graph in minutes.
Download ArcadeDB v26.3.2: GitHub Releases
If you have questions or feedback, join us on Discord or open an issue on GitHub.