Back to Blog

Declarative Graph Importer: Import StackOverflow into a Graph with a Single JSON File

Importing a real-world dataset into a graph database usually means writing a custom ETL script: parse the files, resolve foreign keys, batch your transactions, handle edge cases. It works, but it’s tedious, error-prone, and you end up throwing away the script once the import is done.

ArcadeDB v26.3.2 introduces the GraphImporter — a declarative tool that turns CSV, XML, and JSONL files into a fully connected graph using nothing but a JSON configuration file. No code, no custom scripts. Under the hood it uses the GraphBatch engine for maximum throughput.

Let’s see how it works by importing a real dataset: the StackOverflow data dump.

The StackOverflow Graph Model

The StackOverflow data dump is a classic dataset for benchmarking and graph analysis. It ships as a set of XML files, each representing a table in the original relational schema. Here’s the graph model we’ll build:

            ASKED                    TAGGED_WITH
  User ──────────────> Question ──────────────> Tag
    │                   │    ^
    │ ANSWERED           │    │ HAS_ANSWER
    v                   │    │
  Answer <──────────────┘    │
    ^                        │
    │ ACCEPTED_ANSWER        │
    └────────────────────────┘

  User ──WROTE_COMMENT──> Comment ──COMMENTED_ON──> Question/Answer
  User ──EARNED──> Badge
  Question ──LINKED_TO──> Question

Six vertex types, eight edge types, all derived from six XML files. Let’s see how to express this as a single JSON configuration.

The Import Configuration

Here’s the complete JSON file that defines the entire import:

{
  "vertices": [
    {
      "type": "Tag", "file": "Tags.xml", "id": "Id", "nameId": "TagName",
      "properties": { "Id": "int:Id", "TagName": "TagName", "Count": "int:Count" }
    },
    {
      "type": "User", "file": "Users.xml", "id": "Id",
      "properties": {
        "Id": "int:Id", "DisplayName": "DisplayName", "Reputation": "int:Reputation",
        "CreationDate": "CreationDate", "Views": "int:Views",
        "UpVotes": "int:UpVotes", "DownVotes": "int:DownVotes"
      }
    },
    {
      "type": "Question", "file": "Posts.xml", "id": "Id", "filter": "PostTypeId=1",
      "properties": {
        "Id": "int:Id", "Title": "Title", "Body": "Body", "Score": "int:Score",
        "ViewCount": "int:ViewCount", "CreationDate": "CreationDate",
        "AnswerCount": "int:AnswerCount", "CommentCount": "int:CommentCount", "Tags": "Tags"
      },
      "edges": [
        { "attribute": "OwnerUserId", "edge": "ASKED", "target": "User", "direction": "in" },
        { "attribute": "Tags", "edge": "TAGGED_WITH", "target": "Tag", "split": "|" }
      ]
    },
    {
      "type": "Answer", "file": "Posts.xml", "id": "Id", "filter": "PostTypeId=2",
      "properties": {
        "Id": "int:Id", "Body": "Body", "Score": "int:Score",
        "CreationDate": "CreationDate", "CommentCount": "int:CommentCount"
      },
      "edges": [
        { "attribute": "OwnerUserId", "edge": "ANSWERED", "target": "User", "direction": "in" },
        { "attribute": "ParentId", "edge": "HAS_ANSWER", "target": "Question", "direction": "in" }
      ]
    },
    {
      "type": "Comment", "file": "Comments.xml", "id": "Id",
      "properties": {
        "Id": "int:Id", "Score": "int:Score", "CreationDate": "CreationDate", "Text": "Text"
      },
      "edges": [
        { "attribute": "PostId", "edge": "COMMENTED_ON", "target": "Question" },
        { "attribute": "PostId", "edge": "COMMENTED_ON_ANSWER", "target": "Answer" },
        { "attribute": "UserId", "edge": "WROTE_COMMENT", "target": "User", "direction": "in" }
      ]
    },
    {
      "type": "Badge", "file": "Badges.xml", "id": "Id",
      "properties": {
        "Id": "int:Id", "Name": "Name", "Date": "Date",
        "BadgeClass": "int:Class", "TagBased": "bool:TagBased"
      },
      "edges": [
        { "attribute": "UserId", "edge": "EARNED", "target": "User", "direction": "in" }
      ]
    }
  ],

  "edgeSources": [
    {
      "edge": "ACCEPTED_ANSWER", "file": "Posts.xml",
      "from": "Id:Question", "to": "AcceptedAnswerId:Answer"
    },
    {
      "edge": "LINKED_TO", "file": "PostLinks.xml",
      "from": "PostId:Question", "to": "RelatedPostId:Question",
      "properties": { "LinkType": "int:LinkTypeId" }
    }
  ],

  "postImportCommands": [
    {
      "language": "sql",
      "command": "CREATE GRAPH ANALYTICAL VIEW IF NOT EXISTS stackoverflow PROPERTIES (`!Body`, `!Text`) UPDATE MODE SYNCHRONOUS"
    }
  ]
}

That’s it. Six vertex types, eight edge types, post-import Graph Analytical View — all in one file. Let’s break down the key patterns.

Key Patterns Explained

Splitting One File into Multiple Vertex Types

StackOverflow stores both questions and answers in the same Posts.xml file, distinguished by PostTypeId. The filter option lets you import them as separate vertex types:

{ "type": "Question", "file": "Posts.xml", "filter": "PostTypeId=1", ... }
{ "type": "Answer",   "file": "Posts.xml", "filter": "PostTypeId=2", ... }

The importer reads Posts.xml once per definition, but only creates vertices for rows matching the filter. This is a common pattern when a single source table contains multiple logical entity types.

Foreign Key Resolution

Most edges are derived from foreign key attributes in the source data. The importer needs to know two things: which attribute holds the foreign key, and which vertex type it references.

Outgoing edges — the foreign key is in this vertex’s source, pointing to the target:

{ "attribute": "ParentId", "edge": "HAS_ANSWER", "target": "Question", "direction": "in" }

This means: “read ParentId from each Answer row, find the Question with that ID, and create a HAS_ANSWER edge from the Question to this Answer”. The "direction": "in" flips the edge so the Question is the source (the question has an answer, not the other way around).

Default direction is "out" — the current vertex is the edge source:

{ "attribute": "PostId", "edge": "COMMENTED_ON", "target": "Question" }

This creates an edge from Comment to Question.

Split-Field Edges (Multi-Value Attributes)

StackOverflow stores tags as a single delimited string like <java><python><sql>. The split option expands this into multiple edges:

{ "attribute": "Tags", "edge": "TAGGED_WITH", "target": "Tag", "split": "|" }

For a question tagged |java|python|sql|, this creates three TAGGED_WITH edges — one to each Tag vertex. The split values are resolved using the target’s nameId attribute (in this case, TagName), not the integer id.

Edge-Only Sources

Some relationships live in their own source file rather than as foreign keys in a vertex file. The edgeSources section handles these:

{
  "edge": "LINKED_TO", "file": "PostLinks.xml",
  "from": "PostId:Question", "to": "RelatedPostId:Question",
  "properties": { "LinkType": "int:LinkTypeId" }
}

The compact "attribute:vertexType" syntax tells the importer which attribute to read and which vertex type to resolve against. Both endpoints must already exist (vertex sources are processed first).

Property Type Mapping

Properties are strings by default. Prefix the source attribute with a type hint for automatic conversion:

Syntax Type Example
"DisplayName" String "name": "DisplayName"
"int:Score" Integer "score": "int:Score"
"bool:TagBased" Boolean "tagBased": "bool:TagBased"

Post-Import Commands

The postImportCommands array runs SQL (or any supported language) after the import completes. In this example, we create a Graph Analytical View that pre-computes the graph structure for fast OLAP queries, excluding large text properties (Body, Text) to keep the view compact:

{
  "language": "sql",
  "command": "CREATE GRAPH ANALYTICAL VIEW IF NOT EXISTS stackoverflow PROPERTIES (`!Body`, `!Text`) UPDATE MODE SYNCHRONOUS"
}

Running the Import

From the Command Line

java com.arcadedb.integration.importer.graph.GraphImporter \
    stackoverflow-import.json \
    /path/to/database \
    /path/to/stackoverflow-data

The importer auto-creates the schema (vertex and edge types) from the JSON config, runs the two-pass import, and executes post-import commands. File paths in the JSON are resolved relative to the data directory (third argument).

From Java

Database database = new DatabaseFactory("/path/to/database").create();

String json = Files.readString(Path.of("stackoverflow-import.json"));
GraphImporter.createSchemaFromConfig(database, new JSONObject(json));

try (GraphImporter importer = GraphImporter.fromJSON(database, json, "/path/to/data")) {
    importer.run();
    System.out.printf("Vertices: %,d  Edges: %,d%n",
        importer.getVertexCount(), importer.getEdgeCount());
}

GraphImporter.executePostImportCommands(database, new JSONObject(json));

Programmatic Builder API

If you prefer code over JSON, the same import can be expressed with the builder:

GraphImporter.builder(database)
    .vertex("Tag", XmlRowSource.from(dataDir, "Tags.xml"), v -> {
        v.id("Id");
        v.idByName("TagName");
        v.property("TagName", "TagName");
        v.intProperty("Count", "Count");
    })
    .vertex("User", XmlRowSource.from(dataDir, "Users.xml"), v -> {
        v.id("Id");
        v.property("DisplayName", "DisplayName");
        v.intProperty("Reputation", "Reputation");
    })
    .vertex("Question", XmlRowSource.from(dataDir, "Posts.xml"), v -> {
        v.filter("PostTypeId", "1");
        v.id("Id");
        v.property("Title", "Title");
        v.intProperty("Score", "Score");
        v.edgeIn("OwnerUserId", "ASKED", "User");
        v.splitEdge("Tags", "TAGGED_WITH", "Tag", "|");
    })
    // ... remaining vertex and edge sources
    .build()
    .run();

How It Works Under the Hood

The GraphImporter uses a two-pass, CSR-first (Compressed Sparse Row) architecture:

Pass 1 — Vertices and topology collection. Each data source is read once. Vertices are created with full properties and flushed to disk immediately. Foreign key values are collected as compressed primitive arrays (int arrays for IDs, bucket/position pairs for RIDs) — no objects, no boxing, minimal GC pressure.

Pass 2 — Edge creation. The collected topology is fed into GraphBatch, which creates all edges with bidirectional traversal support. Each edge type is processed as a single batch for maximum sequential I/O.

This design means vertex data doesn’t stay in memory — only the graph topology does. For a dataset with 8 million vertices and 15 million edges, the in-memory topology is roughly 300 MB.

Supported Data Sources

Format Auto-detected Notes
CSV .csv Configurable delimiter ("delimiter": ",") and skip lines ("skipLines": 1)
JSONL .jsonl, .ndjson One JSON object per line
XML .xml Attribute-based by default (StackOverflow-style <row .../>). Set "element": "book" for child-element parsing

All sources are streamed — the importer never loads an entire file into memory.

Configuration Reference

Vertex Source

Field Required Description
type Yes ArcadeDB vertex type name (auto-created if missing)
file Yes Source file path, relative to the data directory
id No Integer primary key attribute for edge resolution
nameId No String-based secondary key (for split-field edge resolution)
filter No Row filter: "attribute=value" — only matching rows are imported
properties No Map of "dbPropertyName": "SourceAttr" (or "int:Attr", "bool:Attr")
edges No Array of edge definitions derived from foreign keys in this source

Edge Definition (inside a vertex source)

Field Required Description
attribute Yes Source attribute containing the foreign key value
edge Yes ArcadeDB edge type name (auto-created if missing)
target Yes Target vertex type the foreign key references
direction No "out" (default) or "in" — controls edge direction
split No Delimiter for multi-value fields (creates one edge per value)

Edge-Only Source

Field Required Description
edge Yes ArcadeDB edge type name
file Yes Source file path
from Yes "attribute:vertexType" — source vertex reference
to Yes "attribute:vertexType" — target vertex reference
properties No Map of "dbPropertyName": "int:SourceAttr"

Get Started

The GraphImporter is available starting from ArcadeDB v26.3.2. Download the StackOverflow data dump, grab the JSON config above, and you’ll have a fully connected graph in minutes.

Download ArcadeDB v26.3.2: GitHub Releases

If you have questions or feedback, join us on Discord or open an issue on GitHub.