What is the ArcadeDB GraphImporter?

The GraphImporter is a declarative, high-performance bulk importer that reads CSV, XML, or JSONL files and builds a fully connected graph in ArcadeDB. It uses a two-pass architecture (vertices first, then edges) powered by the GraphBatch engine for maximum throughput with minimal memory usage.

Can I use the GraphImporter without writing Java code?

Yes. The GraphImporter can be driven entirely by a JSON configuration file and run from the command line. The JSON file describes vertex types, property mappings, edge relationships, and data sources — no code required.

What data formats does the GraphImporter support?

CSV (with configurable delimiter), XML (attribute-based or child-element parsing), and JSONL/NDJSON (one JSON object per line). The format is auto-detected from the file extension.

How much memory does the GraphImporter need?

The two-pass architecture keeps the memory footprint low — roughly 300 MB for 8 million vertices and 15 million edges. Vertex properties are written to disk in Pass 1; only the graph topology (compressed integer arrays) is held in memory for Pass 2.

Can I import the same source file as multiple vertex types?

Yes. Use the 'filter' option to split one file into multiple vertex types based on an attribute value. For example, StackOverflow's Posts.xml contains both questions and answers — you can import them as separate Question and Answer vertex types by filtering on PostTypeId.

Declarative Graph Importer: Import StackOverflow into a Graph with a Single JSON File

Q: Which ArcadeDB version includes the GraphImporter?

The GraphImporter is available starting from ArcadeDB v26.3.2.

Importing a real-world dataset into a graph database usually means writing a custom ETL script: parse the files, resolve foreign keys, batch your transactions, handle edge cases. It works, but it’s tedious, error-prone, and you end up throwing away the script once the import is done.

ArcadeDB v26.3.2 introduces the GraphImporter — a declarative tool that turns CSV, XML, and JSONL files into a fully connected graph using nothing but a JSON configuration file. No code, no custom scripts. Under the hood it uses the GraphBatch engine for maximum throughput.

Let’s see how it works by importing a real dataset: the StackOverflow data dump.

The StackOverflow Graph Model

The StackOverflow data dump is a classic dataset for benchmarking and graph analysis. It ships as a set of XML files, each representing a table in the original relational schema. Here’s the graph model we’ll build:

            ASKED                    TAGGED_WITH
  User ──────────────> Question ──────────────> Tag
    │                   │    ^
    │ ANSWERED           │    │ HAS_ANSWER
    v                   │    │
  Answer <──────────────┘    │
    ^                        │
    │ ACCEPTED_ANSWER        │
    └────────────────────────┘

  User ──WROTE_COMMENT──> Comment ──COMMENTED_ON──> Question/Answer
  User ──EARNED──> Badge
  Question ──LINKED_TO──> Question

Six vertex types, eight edge types, all derived from six XML files. Let’s see how to express this as a single JSON configuration.

The Import Configuration

Here’s the complete JSON file that defines the entire import:

{
  "vertices": [
    {
      "type": "Tag", "file": "Tags.xml", "id": "Id", "nameId": "TagName",
      "properties": { "Id": "int:Id", "TagName": "TagName", "Count": "int:Count" }
    },
    {
      "type": "User", "file": "Users.xml", "id": "Id",
      "properties": {
        "Id": "int:Id", "DisplayName": "DisplayName", "Reputation": "int:Reputation",
        "CreationDate": "CreationDate", "Views": "int:Views",
        "UpVotes": "int:UpVotes", "DownVotes": "int:DownVotes"
      }
    },
    {
      "type": "Question", "file": "Posts.xml", "id": "Id", "filter": "PostTypeId=1",
      "properties": {
        "Id": "int:Id", "Title": "Title", "Body": "Body", "Score": "int:Score",
        "ViewCount": "int:ViewCount", "CreationDate": "CreationDate",
        "AnswerCount": "int:AnswerCount", "CommentCount": "int:CommentCount", "Tags": "Tags"
      },
      "edges": [
        { "attribute": "OwnerUserId", "edge": "ASKED", "target": "User", "direction": "in" },
        { "attribute": "Tags", "edge": "TAGGED_WITH", "target": "Tag", "split": "|" }
      ]
    },
    {
      "type": "Answer", "file": "Posts.xml", "id": "Id", "filter": "PostTypeId=2",
      "properties": {
        "Id": "int:Id", "Body": "Body", "Score": "int:Score",
        "CreationDate": "CreationDate", "CommentCount": "int:CommentCount"
      },
      "edges": [
        { "attribute": "OwnerUserId", "edge": "ANSWERED", "target": "User", "direction": "in" },
        { "attribute": "ParentId", "edge": "HAS_ANSWER", "target": "Question", "direction": "in" }
      ]
    },
    {
      "type": "Comment", "file": "Comments.xml", "id": "Id",
      "properties": {
        "Id": "int:Id", "Score": "int:Score", "CreationDate": "CreationDate", "Text": "Text"
      },
      "edges": [
        { "attribute": "PostId", "edge": "COMMENTED_ON", "target": "Question" },
        { "attribute": "PostId", "edge": "COMMENTED_ON_ANSWER", "target": "Answer" },
        { "attribute": "UserId", "edge": "WROTE_COMMENT", "target": "User", "direction": "in" }
      ]
    },
    {
      "type": "Badge", "file": "Badges.xml", "id": "Id",
      "properties": {
        "Id": "int:Id", "Name": "Name", "Date": "Date",
        "BadgeClass": "int:Class", "TagBased": "bool:TagBased"
      },
      "edges": [
        { "attribute": "UserId", "edge": "EARNED", "target": "User", "direction": "in" }
      ]
    }
  ],

  "edgeSources": [
    {
      "edge": "ACCEPTED_ANSWER", "file": "Posts.xml",
      "from": "Id:Question", "to": "AcceptedAnswerId:Answer"
    },
    {
      "edge": "LINKED_TO", "file": "PostLinks.xml",
      "from": "PostId:Question", "to": "RelatedPostId:Question",
      "properties": { "LinkType": "int:LinkTypeId" }
    }
  ],

  "postImportCommands": [
    {
      "language": "sql",
      "command": "CREATE GRAPH ANALYTICAL VIEW IF NOT EXISTS stackoverflow PROPERTIES (`!Body`, `!Text`) UPDATE MODE SYNCHRONOUS"
    }
  ]
}

That’s it. Six vertex types, eight edge types, post-import Graph Analytical View — all in one file. Let’s break down the key patterns.

Key Patterns Explained

Splitting One File into Multiple Vertex Types

StackOverflow stores both questions and answers in the same Posts.xml file, distinguished by PostTypeId. The filter option lets you import them as separate vertex types:

{ "type": "Question", "file": "Posts.xml", "filter": "PostTypeId=1", ... }
{ "type": "Answer",   "file": "Posts.xml", "filter": "PostTypeId=2", ... }

The importer reads Posts.xml once per definition, but only creates vertices for rows matching the filter. This is a common pattern when a single source table contains multiple logical entity types.

Foreign Key Resolution

Most edges are derived from foreign key attributes in the source data. The importer needs to know two things: which attribute holds the foreign key, and which vertex type it references.

Outgoing edges — the foreign key is in this vertex’s source, pointing to the target:

{ "attribute": "ParentId", "edge": "HAS_ANSWER", "target": "Question", "direction": "in" }

This means: “read ParentId from each Answer row, find the Question with that ID, and create a HAS_ANSWER edge from the Question to this Answer”. The "direction": "in" flips the edge so the Question is the source (the question has an answer, not the other way around).

Default direction is "out" — the current vertex is the edge source:

{ "attribute": "PostId", "edge": "COMMENTED_ON", "target": "Question" }

This creates an edge from Comment to Question.

Split-Field Edges (Multi-Value Attributes)

StackOverflow stores tags as a single delimited string like <java><python><sql>. The split option expands this into multiple edges:

{ "attribute": "Tags", "edge": "TAGGED_WITH", "target": "Tag", "split": "|" }

For a question tagged |java|python|sql|, this creates three TAGGED_WITH edges — one to each Tag vertex. The split values are resolved using the target’s nameId attribute (in this case, TagName), not the integer id.

Edge-Only Sources

Some relationships live in their own source file rather than as foreign keys in a vertex file. The edgeSources section handles these:

{
  "edge": "LINKED_TO", "file": "PostLinks.xml",
  "from": "PostId:Question", "to": "RelatedPostId:Question",
  "properties": { "LinkType": "int:LinkTypeId" }
}

The compact "attribute:vertexType" syntax tells the importer which attribute to read and which vertex type to resolve against. Both endpoints must already exist (vertex sources are processed first).

Property Type Mapping

Properties are strings by default. Prefix the source attribute with a type hint for automatic conversion:

Syntax	Type	Example
`"DisplayName"`	String	`"name": "DisplayName"`
`"int:Score"`	Integer	`"score": "int:Score"`
`"bool:TagBased"`	Boolean	`"tagBased": "bool:TagBased"`

Post-Import Commands

The postImportCommands array runs SQL (or any supported language) after the import completes. In this example, we create a Graph Analytical View that pre-computes the graph structure for fast OLAP queries, excluding large text properties (Body, Text) to keep the view compact:

{
  "language": "sql",
  "command": "CREATE GRAPH ANALYTICAL VIEW IF NOT EXISTS stackoverflow PROPERTIES (`!Body`, `!Text`) UPDATE MODE SYNCHRONOUS"
}

Running the Import

From the Command Line

java com.arcadedb.integration.importer.graph.GraphImporter \
    stackoverflow-import.json \
    /path/to/database \
    /path/to/stackoverflow-data

The importer auto-creates the schema (vertex and edge types) from the JSON config, runs the two-pass import, and executes post-import commands. File paths in the JSON are resolved relative to the data directory (third argument).

From Java

Database database = new DatabaseFactory("/path/to/database").create();

String json = Files.readString(Path.of("stackoverflow-import.json"));
GraphImporter.createSchemaFromConfig(database, new JSONObject(json));

try (GraphImporter importer = GraphImporter.fromJSON(database, json, "/path/to/data")) {
    importer.run();
    System.out.printf("Vertices: %,d  Edges: %,d%n",
        importer.getVertexCount(), importer.getEdgeCount());
}

GraphImporter.executePostImportCommands(database, new JSONObject(json));

Programmatic Builder API

If you prefer code over JSON, the same import can be expressed with the builder:

GraphImporter.builder(database)
    .vertex("Tag", XmlRowSource.from(dataDir, "Tags.xml"), v -> {
        v.id("Id");
        v.idByName("TagName");
        v.property("TagName", "TagName");
        v.intProperty("Count", "Count");
    })
    .vertex("User", XmlRowSource.from(dataDir, "Users.xml"), v -> {
        v.id("Id");
        v.property("DisplayName", "DisplayName");
        v.intProperty("Reputation", "Reputation");
    })
    .vertex("Question", XmlRowSource.from(dataDir, "Posts.xml"), v -> {
        v.filter("PostTypeId", "1");
        v.id("Id");
        v.property("Title", "Title");
        v.intProperty("Score", "Score");
        v.edgeIn("OwnerUserId", "ASKED", "User");
        v.splitEdge("Tags", "TAGGED_WITH", "Tag", "|");
    })
    // ... remaining vertex and edge sources
    .build()
    .run();

How It Works Under the Hood

The GraphImporter uses a two-pass, CSR-first (Compressed Sparse Row) architecture:

Pass 1 — Vertices and topology collection. Each data source is read once. Vertices are created with full properties and flushed to disk immediately. Foreign key values are collected as compressed primitive arrays (int arrays for IDs, bucket/position pairs for RIDs) — no objects, no boxing, minimal GC pressure.

Pass 2 — Edge creation. The collected topology is fed into GraphBatch, which creates all edges with bidirectional traversal support. Each edge type is processed as a single batch for maximum sequential I/O.

This design means vertex data doesn’t stay in memory — only the graph topology does. For a dataset with 8 million vertices and 15 million edges, the in-memory topology is roughly 300 MB.

Supported Data Sources

Format	Auto-detected	Notes
CSV	`.csv`	Configurable delimiter (`"delimiter": ","`) and skip lines (`"skipLines": 1`)
JSONL	`.jsonl`, `.ndjson`	One JSON object per line
XML	`.xml`	Attribute-based by default (StackOverflow-style `<row .../>`). Set `"element": "book"` for child-element parsing

All sources are streamed — the importer never loads an entire file into memory.

Configuration Reference

Vertex Source

Field	Required	Description
`type`	Yes	ArcadeDB vertex type name (auto-created if missing)
`file`	Yes	Source file path, relative to the data directory
`id`	No	Integer primary key attribute for edge resolution
`nameId`	No	String-based secondary key (for split-field edge resolution)
`filter`	No	Row filter: `"attribute=value"` — only matching rows are imported
`properties`	No	Map of `"dbPropertyName": "SourceAttr"` (or `"int:Attr"`, `"bool:Attr"`)
`edges`	No	Array of edge definitions derived from foreign keys in this source

Edge Definition (inside a vertex source)

Field	Required	Description
`attribute`	Yes	Source attribute containing the foreign key value
`edge`	Yes	ArcadeDB edge type name (auto-created if missing)
`target`	Yes	Target vertex type the foreign key references
`direction`	No	`"out"` (default) or `"in"` — controls edge direction
`split`	No	Delimiter for multi-value fields (creates one edge per value)

Edge-Only Source

Field	Required	Description
`edge`	Yes	ArcadeDB edge type name
`file`	Yes	Source file path
`from`	Yes	`"attribute:vertexType"` — source vertex reference
`to`	Yes	`"attribute:vertexType"` — target vertex reference
`properties`	No	Map of `"dbPropertyName": "int:SourceAttr"`

Get Started

The GraphImporter is available starting from ArcadeDB v26.3.2. Download the StackOverflow data dump, grab the JSON config above, and you’ll have a fully connected graph in minutes.

Download ArcadeDB v26.3.2: GitHub Releases

If you have questions or feedback, join us on Discord or open an issue on GitHub.