AI/ML Feature Store

Compute graph features, store vector embeddings, track time-series signals, and serve them all in under 10ms — from a single database.

Data Scientists Spend 80% of Their Time on Data, Not Models

The most cited statistic in ML engineering is also the most frustrating: data scientists spend 50-80% of their time on data collection, cleaning, and feature engineering rather than building models. Teams spend over 40% of engineering effort maintaining data pipelines rather than creating new features. Less than 20% of their time goes to actual model development.

The root cause is architectural. Production ML models need features from multiple data types:

  • Graph features: PageRank, community membership, centrality, proximity to known entities (computed from a graph database)
  • Vector embeddings: User and item representations from deep learning models (stored in a vector database)
  • Time-series features: Rolling averages, trend slopes, seasonality indicators, lag values (from a time-series store)
  • Document features: Structured metadata, configuration, text features (from a document store)

Today, each feature type lives in a separate system. Feature stores like Feast or Tecton sit on top, coordinating access. But they don't solve the underlying problem: the data itself is scattered across 4-5 databases, connected by fragile ETL pipelines that introduce staleness, skew, and operational complexity.

ArcadeDB stores and computes all four feature types natively. No ETL. No sync. No skew.

Unified Feature Pipeline

Graph Features Vector Embeddings Time Series Signals Document Metadata ArcadeDB Feature Store <10ms p99 ML Model Inference ML Model Training PageRank ANN search rate(), avg() JSON props serve train Same features for training and serving = zero skew

Store and Query Embeddings

Store embeddings directly on graph vertices and query them with ANN search, combined with structured filters:

-- Find products similar to a
-- given embedding vector
SELECT name, category, price
FROM Product
ORDER BY vectorNeighbors(
    'Product[embedding]',
    [0.9, 0.1, 0.1, 0.1], 10)
  DESC
LIMIT 5

Filtered Vector Search

-- Rank Electronics products by
-- similarity to user preference
SELECT name, price
FROM Product
WHERE category = 'Electronics'
ORDER BY vectorNeighbors(
    'Product[embedding]',
    [0.9, 0.1, 0.1, 0.1], 20)
  DESC
LIMIT 10

Vector Embeddings: Powered by JVector

Vector embeddings are the foundation of modern AI. Text, images, user behavior, and product attributes are encoded as high-dimensional vectors (typically 384 to 1536 dimensions), enabling similarity-based operations that no other search technique can match.

ArcadeDB integrates JVector, a state-of-the-art vector search engine that merges DiskANN and HNSW algorithms into a single graph-based index. JVector uses SIMD-accelerated distance computations via the Panama Vector API and supports incremental inserts without full index rebuilds — essential for production systems with continuous updates.

What JVector enables in ArcadeDB:

  • ANN search: Approximate nearest-neighbor queries at sub-millisecond latency, even at billion-vector scale
  • Hybrid search: Combine vector similarity with structured metadata filters (category, price, date) in a single query
  • Disk-resident indexes: Compressed in-memory representations with full vectors on disk, enabling larger-than-RAM datasets without sacrificing speed
  • Incremental updates: Insert new vectors into an existing index without rebuilding — no downtime, no batch reindexing
  • Embeddings on graph vertices: Vectors are stored directly on nodes, enabling queries that combine graph traversal with vector similarity in a single operation

Unlike standalone vector databases (Pinecone, Weaviate, Qdrant), ArcadeDB's vector support is not isolated. Embeddings live alongside graph relationships, time-series data, and document metadata — all queryable together.

Graph Features: The Signal That Tabular Models Miss

Traditional ML features describe individual entities: a customer's age, a transaction's amount, a product's price. But the most predictive signals often lie in relationships — who is connected to whom, how tightly clustered a neighborhood is, how central a node is in a network.

Graph-derived features encode structural and relational information that is completely invisible to tabular approaches:

Centrality Features

PageRank measures influence based on the quality of incoming connections. Betweenness centrality identifies brokers sitting on shortest paths between communities. Degree centrality counts direct connections. These features transform a user's position in a social network or a transaction's role in a financial flow into numeric ML inputs.

Community Features

Community detection (Louvain, Label Propagation) assigns cluster IDs that become categorical features. Triangle count and clustering coefficient measure how tightly connected a node's neighborhood is. In fraud detection, a high clustering coefficient near a flagged account signals a coordinated ring.

Similarity and Path Features

Node similarity (Jaccard, Cosine) measures neighborhood overlap between entities. Shortest path length to known entities (e.g., distance to a known fraudster) is a powerful risk feature. Common neighbors count drives link prediction.

Compute Graph Features for ML

Extract structural features from the graph and serve them alongside embeddings for model inference:

-- Graph topology features for
-- a suspect account
SELECT inDeg, outDeg, counterparties
FROM (
  MATCH
    {type: Account,
     where: (accountId = 'a4'),
     as: acct}
  RETURN
    acct.in('TRANSFERRED').size()
      AS inDeg,
    acct.out('TRANSFERRED').size()
      AS outDeg,
    acct.both('TRANSFERRED').size()
      AS counterparties
)
-- Distance to nearest flagged
-- account via transfers
SELECT accountId AS flaggedId, depth
FROM (
  MATCH
    {type: Account,
     where: (accountId = 'a4')}
    .both('TRANSFERRED')
    {while: ($depth < 4),
     as: hop}
  RETURN hop.accountId AS accountId,
    hop.flagged AS flagged,
    $depth AS depth
)
WHERE flagged = true
ORDER BY depth ASC
LIMIT 1

ArcadeDB SQL MATCH computes graph topology features (degree, counterparties) and shortest-path proximity to flagged accounts — key signals for fraud models.

Transaction Velocity Features

Aggregate time-bucketed metrics to compute velocity and volume features for ML models:

-- Transaction velocity features
-- per account (fraud signals)
SELECT
  accountId,
  sum(txCount) AS totalTx,
  sum(totalAmount) AS totalAmount,
  avg(totalAmount) AS avgBucketAmount
FROM TransactionMetric
GROUP BY accountId
ORDER BY totalTx DESC

Sensor Anomaly Detection

-- Detect equipment with
-- anomalous sensor readings
SELECT
  equipmentId,
  avg(temperature) AS avgTemp,
  max(vibration) AS maxVibration,
  avg(pressure) AS avgPressure
FROM SensorReading
GROUP BY equipmentId
ORDER BY avgTemp DESC

Aggregate sensor readings by equipment to identify anomalous temperature, vibration, and pressure patterns — key features for predictive maintenance models.

Time-Series Features: Capturing Temporal Patterns

Many of the most powerful ML features are temporal. A transaction amount alone is moderately predictive; that same amount compared to the customer's 30-day rolling average is far more informative. A sensor reading is useful; its rate of change over the last hour is critical for predicting failure.

ArcadeDB computes temporal features directly using standard SQL aggregations on time-bucketed document types, eliminating the need for external processing:

  • Rolling aggregates: Sum, average, min/max over configurable time buckets — the backbone of most time-series features
  • Transaction velocity: Aggregate transaction counts and amounts per time bucket to detect acceleration and deceleration patterns
  • Sensor anomaly detection: Group sensor readings by equipment and compute average temperature, max vibration, and pressure statistics to identify anomalous behavior
  • Time-bucketed metrics: Store pre-aggregated metrics in document types (e.g., TransactionMetric, SensorReading) for efficient temporal queries
  • Cross-model joins: Combine time-series document data with graph topology and vector similarity in multi-step feature assembly

These features are computed at query time with the latest data, ensuring zero staleness — eliminating one of the most common sources of training-serving skew in production ML systems.

Eliminate Training-Serving Skew

Training-serving skew is the most insidious problem in production ML. It occurs when features are computed differently during training than during inference — different code paths, different libraries, different aggregation windows, different data freshness. The model silently degrades because it's seeing data that doesn't match what it was trained on.

Traditional feature store architectures make skew almost inevitable. Features are computed by Spark batch jobs for training (offline store), then recomputed by different code for inference (online store). Two codebases, two data paths, two opportunities for divergence.

ArcadeDB eliminates skew by using the same query engine for both training and serving. The exact same SQL or Cypher query that extracts features for training data runs at inference time. Same code path, same aggregation logic, same data source.

  • One query, two contexts: Run the feature extraction query against historical data for training, then against live data for inference — same SQL, same results
  • No separate offline/online stores: ArcadeDB serves both batch reads (training) and point lookups (inference) from the same engine
  • Real-time freshness: Features reflect the latest data at query time, not a stale batch snapshot
  • Auditable lineage: The query IS the feature definition. No ambiguity about how a feature was computed

Same Query, Two Contexts

-- Step 1: Graph features
SELECT inDeg, outDeg, counterparties
FROM (
  MATCH {type: Account,
    where: (accountId = 'a4'),
    as: acct}
  RETURN
    acct.in('TRANSFERRED').size() AS inDeg,
    acct.out('TRANSFERRED').size() AS outDeg,
    acct.both('TRANSFERRED').size() AS counterparties)

-- Step 2: Vector similarity
SELECT accountId, flagged
FROM Account
ORDER BY vectorNeighbors(
  'Account[behaviorVec]',
  [0.7, 0.6, 0.2, 0.3], 10) DESC
LIMIT 5

-- Step 3: Time-series velocity
SELECT sum(txCount) AS totalTx,
  sum(totalAmount) AS totalAmount,
  avg(totalAmount) AS avgBucketAmount
FROM TransactionMetric
WHERE accountId = 'a4'

Assemble a complete fraud feature vector in three steps: graph topology, vector similarity to known fraud patterns, and transaction velocity metrics. Same queries work for both training and inference — zero skew.

Feature Types by Use Case

Use Case Graph Vector Time Series Doc
Fraud detection
Recommendations
Predictive maintenance
Churn prediction
Demand forecasting
Semantic search / RAG
Customer segmentation

Every use case benefits from multiple feature types. Only a multi-model database can compute and serve all of them without cross-system ETL.

ML Use Cases Powered by Multi-Model Features

Fraud Detection

Graph: proximity to flagged accounts, community risk scores, transaction ring structure. Vectors: behavioral embedding deviation from baseline. Time series: transaction velocity spikes, rolling average deviations. Document: device metadata, geolocation.

Recommendations

Graph: collaborative filtering through purchase overlap, Personalized PageRank. Vectors: item embedding similarity, user preference vectors. Time series: trending items via rate analysis, seasonal patterns, recency weighting.

Predictive Maintenance

Graph: equipment dependency chains, cascade failure propagation. Time series: sensor rolling averages, vibration trends, temperature deviations. Document: maintenance logs, technician notes, equipment specifications.

Semantic Search & RAG

Vectors: document chunk embeddings for ANN retrieval. Graph: knowledge graph relationships between entities for context enrichment. Document: source metadata, access controls, authorship.

Why ArcadeDB for AI/ML

Building ML feature infrastructure typically requires assembling and maintaining a complex stack: a graph database for relationship features, a vector database for embeddings, a time-series database for temporal features, a document store for metadata, and a feature store layer to coordinate everything.

ArcadeDB replaces this entire stack with a single engine:

  • Graph + Vectors + Time Series + Documents in one database, with cross-model queries
  • JVector-powered ANN search: Sub-millisecond vector similarity at scale, with hybrid structured filtering
  • Zero training-serving skew: Same query engine for batch training and real-time inference
  • Zero ETL: No sync pipelines between systems, no data freshness gaps
  • <10ms feature serving: Fast enough for real-time inference at production scale
  • Three query languages: SQL, Cypher, Gremlin — integrate with any ML framework
  • Incremental vector updates: Add new embeddings without reindexing — no downtime

Apache 2.0 — Forever

ML infrastructure is the foundation of your AI strategy. It must be trustworthy, portable, and free from licensing risk. ArcadeDB is Apache 2.0 forever — no bait-and-switch, no per-vector pricing, no restrictions on embedding or redistributing. Build your AI stack on a foundation that won't shift beneath you.

Platform Comparison

Capability Typical Stack ArcadeDB
Graph features Neo4j GDS Built-in
Vector embeddings Pinecone / Weaviate JVector built-in
Time-series features TimescaleDB / InfluxDB Built-in
Document metadata MongoDB Built-in
Feature coordination Feast / Tecton Not needed
Cross-model queries Application code Single query
Training-serving skew Risk (dual path) Zero (same path)
License Mixed / proprietary Apache 2.0

Industries

  • Financial Services: Fraud scoring, credit risk features, AML pattern features
  • E-commerce: Recommendation features, personalization signals, demand forecasting
  • Healthcare: Patient similarity, treatment outcome prediction, clinical trial matching
  • Manufacturing: Predictive maintenance features, quality prediction, supply chain optimization
  • Ad Tech: Lookalike audience creation, click prediction, bid optimization
  • Cybersecurity: Threat detection features, anomaly scoring, network behavior analysis

Client Success Story

"We use ArcadeDB as our feature store for all production ML models. The ability to combine vector embeddings with graph-computed features in real-time dramatically improved our model performance. Feature serving latency dropped from 45ms to under 10ms, enabling true real-time personalization. The multi-model approach eliminated three separate databases from our stack."

— ML Platform Lead, Technology Unicorn
(Company name confidential)

ML Performance Gains:

  • 78% reduction in feature serving latency (45ms → 10ms)
  • 12% improvement in model accuracy with graph features
  • 3 databases consolidated into 1
  • Real-time feature computation at 100K+ req/sec
  • Zero training-serving skew incidents since migration

Ready to Unify Your ML Feature Pipeline?

Compute graph features, store vector embeddings, extract time-series signals, and serve them all in under 10ms — from a single Apache 2.0 database. No ETL, no skew, no vendor lock-in.