Data Scientists Spend 80% of Their Time on Data, Not Models
The most cited statistic in ML engineering is also the most frustrating: data scientists spend 50-80% of their time on data collection, cleaning, and feature engineering rather than building models. Teams spend over 40% of engineering effort maintaining data pipelines rather than creating new features. Less than 20% of their time goes to actual model development.
The root cause is architectural. Production ML models need features from multiple data types:
- Graph features: PageRank, community membership, centrality, proximity to known entities (computed from a graph database)
- Vector embeddings: User and item representations from deep learning models (stored in a vector database)
- Time-series features: Rolling averages, trend slopes, seasonality indicators, lag values (from a time-series store)
- Document features: Structured metadata, configuration, text features (from a document store)
Today, each feature type lives in a separate system. Feature stores like Feast or Tecton sit on top, coordinating access. But they don't solve the underlying problem: the data itself is scattered across 4-5 databases, connected by fragile ETL pipelines that introduce staleness, skew, and operational complexity.
ArcadeDB stores and computes all four feature types natively. No ETL. No sync. No skew.