AI/ML Feature Store with ArcadeDB

Data Scientists Spend 80% of Their Time on Data, Not Models

The most cited statistic in ML engineering is also the most frustrating: data scientists spend 50-80% of their time on data collection, cleaning, and feature engineering rather than building models. Teams spend over 40% of engineering effort maintaining data pipelines rather than creating new features. Less than 20% of their time goes to actual model development.

The root cause is architectural. Production ML models need features from multiple data types:

Graph features: PageRank, community membership, centrality, proximity to known entities (computed from a graph database)
Vector embeddings: User and item representations from deep learning models (stored in a vector database)
Time-series features: Rolling averages, trend slopes, seasonality indicators, lag values (from a time-series store)
Document features: Structured metadata, configuration, text features (from a document store)

Today, each feature type lives in a separate system. Feature stores like Feast or Tecton sit on top, coordinating access. But they don't solve the underlying problem: the data itself is scattered across 4-5 databases, connected by fragile ETL pipelines that introduce staleness, skew, and operational complexity.

ArcadeDB stores and computes all four feature types natively. No ETL. No sync. No skew.

Unified Feature Pipeline

Store and Query Embeddings

Store embeddings directly on graph vertices and query them with ANN search, combined with structured filters:

-- Hybrid search: vector similarity
-- + structured metadata filters
SELECT name, category, price,
  vectorDistance(embedding,
    $queryVector) AS similarity
FROM Product
WHERE vectorDistance(embedding,
    $queryVector) < 0.3
  AND category = 'electronics'
  AND price < 500
ORDER BY similarity ASC
LIMIT 20

Semantic Search for RAG

-- RAG: retrieve relevant chunks
-- for LLM context injection
SELECT content, source, page,
  vectorDistance(chunk_embedding,
    $questionVector) AS relevance
FROM DocumentChunk
WHERE vectorDistance(chunk_embedding,
    $questionVector) < 0.25
ORDER BY relevance ASC
LIMIT 10

Vector Embeddings: Powered by JVector

Vector embeddings are the foundation of modern AI. Text, images, user behavior, and product attributes are encoded as high-dimensional vectors (typically 384 to 1536 dimensions), enabling similarity-based operations that no other search technique can match.

ArcadeDB integrates JVector, a state-of-the-art vector search engine that merges DiskANN and HNSW algorithms into a single graph-based index. JVector uses SIMD-accelerated distance computations via the Panama Vector API and supports incremental inserts without full index rebuilds — essential for production systems with continuous updates.

What JVector enables in ArcadeDB:

ANN search: Approximate nearest-neighbor queries at sub-millisecond latency, even at billion-vector scale
Hybrid search: Combine vector similarity with structured metadata filters (category, price, date) in a single query
Disk-resident indexes: Compressed in-memory representations with full vectors on disk, enabling larger-than-RAM datasets without sacrificing speed
Incremental updates: Insert new vectors into an existing index without rebuilding — no downtime, no batch reindexing
Embeddings on graph vertices: Vectors are stored directly on nodes, enabling queries that combine graph traversal with vector similarity in a single operation

Unlike standalone vector databases (Pinecone, Weaviate, Qdrant), ArcadeDB's vector support is not isolated. Embeddings live alongside graph relationships, time-series data, and document metadata — all queryable together.

Graph Features: The Signal That Tabular Models Miss

Traditional ML features describe individual entities: a customer's age, a transaction's amount, a product's price. But the most predictive signals often lie in relationships — who is connected to whom, how tightly clustered a neighborhood is, how central a node is in a network.

Graph-derived features encode structural and relational information that is completely invisible to tabular approaches:

Centrality Features

PageRank measures influence based on the quality of incoming connections. Betweenness centrality identifies brokers sitting on shortest paths between communities. Degree centrality counts direct connections. These features transform a user's position in a social network or a transaction's role in a financial flow into numeric ML inputs.

Community Features

Community detection (Louvain, Label Propagation) assigns cluster IDs that become categorical features. Triangle count and clustering coefficient measure how tightly connected a node's neighborhood is. In fraud detection, a high clustering coefficient near a flagged account signals a coordinated ring.

Similarity and Path Features

Node similarity (Jaccard, Cosine) measures neighborhood overlap between entities. Shortest path length to known entities (e.g., distance to a known fraudster) is a powerful risk feature. Common neighbors count drives link prediction.

Compute Graph Features for ML

Extract structural features from the graph and serve them alongside embeddings for model inference:

-- Feature vector for fraud model:
-- graph features + time-series
MATCH (acct:Account {id: $acctId})

-- Degree: how connected?
OPTIONAL MATCH
  (acct)-[r]-()
WITH acct, count(r) AS degree

-- Proximity to flagged accounts
OPTIONAL MATCH
  p = shortestPath(
    (acct)-[*..5]-
    (flagged:Account {flagged: true}))
WITH acct, degree,
  min(length(p)) AS dist_to_fraud

RETURN
  degree,
  dist_to_fraud,
  acct.community_id,
  acct.pagerank,
  acct.clustering_coeff,

  -- Time-series: tx velocity
  ts.rate(acct, 'Transactions',
    'tx_count') AS tx_velocity,

  -- Vector: behavioral embedding
  acct.behavior_embedding

One query returns the complete feature vector: graph features (degree, proximity, community, PageRank, clustering), time-series features (velocity), and the behavioral embedding.

Time-Series Feature Extraction

Compute rolling features, trend indicators, and temporal patterns directly in the database:

-- Temporal features for
-- demand forecasting model
SELECT
  product_id,

  -- Rolling aggregates
  avg(sales) AS avg_7d,
  moving_avg(sales, 30) AS avg_30d,

  -- Trend indicator
  rate(sales) AS sales_velocity,

  -- Volatility
  percentile(sales, 0.95)
    - percentile(sales, 0.05)
    AS volatility_range,

  -- First/last values (open/close)
  first(sales) AS period_open,
  last(sales) AS period_close,
  delta(sales) AS period_change

FROM DailySales
WHERE ts > now() - INTERVAL '7d'
GROUP BY product_id

Correlation Analysis

-- Find features correlated
-- with the target variable
SELECT
  correlate(temperature, failure_rate)
    AS temp_corr,
  correlate(vibration, failure_rate)
    AS vib_corr,
  correlate(pressure, failure_rate)
    AS pres_corr
FROM SensorReading
WHERE ts > now() - INTERVAL '90d'

Pearson correlation helps identify which sensor readings are most predictive of equipment failure — a critical first step in feature selection for predictive maintenance models.

Time-Series Features: Capturing Temporal Patterns

Many of the most powerful ML features are temporal. A transaction amount alone is moderately predictive; that same amount compared to the customer's 30-day rolling average is far more informative. A sensor reading is useful; its rate of change over the last hour is critical for predicting failure.

ArcadeDB's native time-series engine computes temporal features directly, eliminating the need for external processing:

Rolling aggregates: Mean, standard deviation, min/max over configurable windows (7-day, 30-day, 90-day) — the backbone of most time-series features
Rate of change: The rate() function computes velocity of counters, detecting acceleration and deceleration. A rising request rate is a trend; a 10x spike is an anomaly
Moving averages: moving_avg() smooths noisy signals, revealing underlying trends that raw values obscure
Percentile distributions: P50, P95, P99 within time buckets capture the shape of distributions, not just the center
Delta and first/last: Compute period-over-period changes for open/close analysis (financial, inventory, operational)
Correlation: correlate() computes Pearson correlation between any two time-series, enabling automated feature selection

These features are computed at query time with the latest data, ensuring zero staleness — eliminating one of the most common sources of training-serving skew in production ML systems.

Eliminate Training-Serving Skew

Training-serving skew is the most insidious problem in production ML. It occurs when features are computed differently during training than during inference — different code paths, different libraries, different aggregation windows, different data freshness. The model silently degrades because it's seeing data that doesn't match what it was trained on.

Traditional feature store architectures make skew almost inevitable. Features are computed by Spark batch jobs for training (offline store), then recomputed by different code for inference (online store). Two codebases, two data paths, two opportunities for divergence.

ArcadeDB eliminates skew by using the same query engine for both training and serving. The exact same SQL or Cypher query that extracts features for training data runs at inference time. Same code path, same aggregation logic, same data source.

One query, two contexts: Run the feature extraction query against historical data for training, then against live data for inference — same SQL, same results
No separate offline/online stores: ArcadeDB serves both batch reads (training) and point lookups (inference) from the same engine
Real-time freshness: Features reflect the latest data at query time, not a stale batch snapshot
Auditable lineage: The query IS the feature definition. No ambiguity about how a feature was computed

Same Query, Two Contexts

-- This query works for BOTH
-- training (batch) and inference
-- (real-time) -- same code path

MATCH (user:User {id: $userId})

RETURN
  -- Graph features
  user.pagerank,
  user.community_id,
  size((user)-[:PURCHASED]->())
    AS purchase_count,

  -- Time-series features
  ts.rate(user, 'Interactions',
    'event_count') AS engagement,
  ts.last(user, 'Purchases',
    'amount') AS last_purchase,

  -- Vector embedding
  user.preference_vector,

  -- Document features
  user.account_type,
  user.signup_source

For training: iterate over all users and extract features in batch. For inference: call with a specific userId and get features in <10ms. Same query, zero skew.

Feature Types by Use Case

Use Case	Graph	Vector	Time Series	Doc
Fraud detection	✓	✓	✓	✓
Recommendations	✓	✓	✓
Predictive maintenance	✓		✓	✓
Churn prediction	✓	✓	✓
Demand forecasting			✓	✓
Semantic search / RAG	✓	✓		✓
Customer segmentation	✓	✓	✓	✓

Every use case benefits from multiple feature types. Only a multi-model database can compute and serve all of them without cross-system ETL.

ML Use Cases Powered by Multi-Model Features

Fraud Detection

Graph: proximity to flagged accounts, community risk scores, transaction ring structure. Vectors: behavioral embedding deviation from baseline. Time series: transaction velocity spikes, rolling average deviations. Document: device metadata, geolocation.

Recommendations

Graph: collaborative filtering through purchase overlap, Personalized PageRank. Vectors: item embedding similarity, user preference vectors. Time series: trending items via rate analysis, seasonal patterns, recency weighting.

Predictive Maintenance

Graph: equipment dependency chains, cascade failure propagation. Time series: sensor rolling averages, vibration trends, temperature deviations. Document: maintenance logs, technician notes, equipment specifications.

Semantic Search & RAG

Vectors: document chunk embeddings for ANN retrieval. Graph: knowledge graph relationships between entities for context enrichment. Document: source metadata, access controls, authorship.

Why ArcadeDB for AI/ML

Building ML feature infrastructure typically requires assembling and maintaining a complex stack: a graph database for relationship features, a vector database for embeddings, a time-series database for temporal features, a document store for metadata, and a feature store layer to coordinate everything.

ArcadeDB replaces this entire stack with a single engine:

Graph + Vectors + Time Series + Documents in one database, with cross-model queries
JVector-powered ANN search: Sub-millisecond vector similarity at scale, with hybrid structured filtering
Zero training-serving skew: Same query engine for batch training and real-time inference
Zero ETL: No sync pipelines between systems, no data freshness gaps
<10ms feature serving: Fast enough for real-time inference at production scale
Three query languages: SQL, Cypher, Gremlin — integrate with any ML framework
Incremental vector updates: Add new embeddings without reindexing — no downtime

Apache 2.0 — Forever

ML infrastructure is the foundation of your AI strategy. It must be trustworthy, portable, and free from licensing risk. ArcadeDB is Apache 2.0 forever — no bait-and-switch, no per-vector pricing, no restrictions on embedding or redistributing. Build your AI stack on a foundation that won't shift beneath you.

Platform Comparison

Capability	Typical Stack	ArcadeDB
Graph features	Neo4j GDS	Built-in
Vector embeddings	Pinecone / Weaviate	JVector built-in
Time-series features	TimescaleDB / InfluxDB	Built-in
Document metadata	MongoDB	Built-in
Feature coordination	Feast / Tecton	Not needed
Cross-model queries	Application code	Single query
Training-serving skew	Risk (dual path)	Zero (same path)
License	Mixed / proprietary	Apache 2.0

Industries

Financial Services: Fraud scoring, credit risk features, AML pattern features
E-commerce: Recommendation features, personalization signals, demand forecasting
Healthcare: Patient similarity, treatment outcome prediction, clinical trial matching
Manufacturing: Predictive maintenance features, quality prediction, supply chain optimization
Ad Tech: Lookalike audience creation, click prediction, bid optimization
Cybersecurity: Threat detection features, anomaly scoring, network behavior analysis

Client Success Story

"We use ArcadeDB as our feature store for all production ML models. The ability to combine vector embeddings with graph-computed features in real-time dramatically improved our model performance. Feature serving latency dropped from 45ms to under 10ms, enabling true real-time personalization. The multi-model approach eliminated three separate databases from our stack."

— ML Platform Lead, Technology Unicorn
(Company name confidential)

ML Performance Gains:

78% reduction in feature serving latency (45ms → 10ms)
12% improvement in model accuracy with graph features
3 databases consolidated into 1
Real-time feature computation at 100K+ req/sec
Zero training-serving skew incidents since migration

AI/ML Feature Store