AI-Native Data Infrastructure: What Recent Research Tells Us About Where This Is Heading
A briefing on Top 3 research papers from 2025
If you’re building modern data products—analytics, agents, copilots, personalization, fraud, or operations—you’re running into the same structural problem:
Your “data plane” and your “AI plane” are evolving faster than your infrastructure can keep up.
Teams are being squeezed from both sides:
More real-time demands (freshness expectations measured in seconds, not hours)
Higher reliability requirements (enterprise SLAs, exactly-once guarantees, auditability)
A sharp rise in operational complexity (multiple systems stitched together with connectors)
Compute cost pressure (especially for cloud networking, replication, and GPU utilization)
The result: even strong teams end up shipping systems where every new capability—streaming ingestion, vector search, retrieval-augmented generation (RAG), learned DB optimization—adds another layer of infrastructure and tuning.
The research in this post tackles a shared theme: how to make core data infrastructure more “AI-native”—meaning cheaper to run, easier to scale, and better aligned with the workloads we actually serve today.
The State of the Art — and Its Limits
Today’s industry norm looks roughly like this:
Streaming data lands in Kafka (or equivalent) for operational ingestion.
A patchwork of connectors and pipelines move data into the lakehouse for analytics.
For AI applications, you bolt on a vector database (or vector index inside a DB) for retrieval.
Meanwhile, databases increasingly use ML internally—but usually via one-off models trained per dataset and per task, which makes them expensive to deploy and maintain.
This stack worked because each layer was independently best-in-class. But at scale it breaks down:
Duplicated infrastructure (streaming + storage + connectors + indexes)
Costly replication and cross-AZ traffic when designs optimized for ultra-low latency are used for data-heavy cloud ingestion
Inefficient RAG serving when retrieval and generation contend for the same compute resources (typically GPUs/CPUs)
High ML training overhead inside DBMSs, where collecting training data can require executing tens of thousands of SQL queries and take hours or days
The papers below propose three different “cuts” at the problem: redesign streaming for the lakehouse, redesign RAG serving hardware for retrieval + generation, and redesign ML-in-databases around reusable foundation models.
Paper 1:
Ursa: A Lakehouse-Native Data Streaming Engine for Kafka
What Problem This Paper Tackles
Lakehouses are popular because they combine low-cost object storage with warehouse-like guarantees. But real-time ingestion often still depends on Kafka plus connectors that copy data into lakehouse tables—adding operational complexity and cost.
A key observation in the paper: traditional leader-based streaming systems were designed for sub-100ms latency, but many lakehouse ingestion workloads only need sub-second latency (hundreds of milliseconds). When you run “ultra-low-latency” architectures for “data-heavy ingestion” in the cloud, cross-AZ disk replication can drive up network traffic and storage overprovisioning.
Core Idea
Ursa proposes a Kafka-compatible streaming engine that is lakehouse-native: instead of writing events to broker disks and then moving them via connectors, it writes directly to open lakehouse tables on object storage.
It also removes a common scaling pain: leader-based replication. Ursa is described as leaderless and “cloud-native,” aiming to reduce the cost and operational footprint while preserving important semantics like exactly-once delivery.
Why This Is Meaningfully Different
Rather than “optimize Kafka” or “optimize connectors,” Ursa changes the integration point: it treats the lakehouse table format + object storage as the destination streaming systems should write to, not an after-the-fact sink.
Practical Implications
For founders and data platform leaders, the promise is straightforward:
Fewer moving parts (less connector sprawl)
Lower cloud cost by avoiding architectures that force heavy cross-AZ replication for ingestion workloads
A simpler path to near-real-time lakehouse analytics with Kafka-compatible producers/consumers
Paper 2:
Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models
What Problem This Paper Tackles
Retrieval-Augmented Language Models (RALMs) pair an LLM with a vector database: retrieve relevant context via vector search, then generate text using that context. This approach can reduce inference cost because the model doesn’t need to “store” all knowledge in parameters; it can fetch knowledge during inference.
The serving challenge: you now have two heavy workloads in the loop—vector search and LLM inference—and they scale differently.
Core Idea
Chameleon proposes a serving system that is both:
Heterogeneous: use different accelerators for different parts of the pipeline
Disaggregated: scale retrieval and generation independently
In their prototype, vector search runs on FPGAs and LLM inference runs on GPUs, with CPUs coordinating the cluster.
Why This Is Meaningfully Different
Most teams treat RAG serving as “GPU problem + database problem.” Chameleon argues it’s a systems composition problem: retrieval and generation have different compute profiles, so you shouldn’t force them onto the same architecture.
Empirically, they report up to 2.16× latency reduction and 3.18× throughput speedup versus a hybrid CPU-GPU approach .
Practical Implications
For enterprise AI teams, the message is: if RAG is a core workload, hardware architecture becomes product architecture.
You may want infrastructure that treats vector search as a first-class accelerated workload (not an afterthought)
Disaggregation offers a cleaner scaling story: retrieval-heavy and generation-heavy workloads can grow independently
Paper 3:
Towards Foundation Database Models (FDBMs)
What Problem This Paper Tackles
ML inside DBMSs has shown promise for tasks like query optimization, cardinality estimation, and cost estimation. But in practice, adoption is limited because most learned components are instance-specific: train a specialized model for a specific task on a specific dataset.
The paper highlights the real cost: collecting training data can require executing tens of thousands of SQL queries, taking hours or days, and each new dataset/task combination requires retraining and maintenance.
Core Idea
The authors propose “Foundation Database Models”: models that generalize across both datasets and tasks, inspired by how foundation models work in NLP.
Their key architectural idea is a mixture of pre-trained expert models:
A data expert learns embeddings that summarize dataset characteristics (like distributions and correlations) using a transferable encoding that avoids database-specific constants and names
A logical plan expert enriches representations with how relational operations transform data
A physical plan expert adds understanding of physical operators and hardware/runtime factors
Downstream tasks can then be solved by simple “shallow” models on top of these representations, reducing task-specific data needs.
Why This Is Meaningfully Different
Instead of “train a model per DB,” the proposal is “pretrain reusable experts once, then adapt cheaply.” The mechanism is not magic—it’s modularity + transferable representations.
Practical Implications
If this direction matures, it could make ML-in-DBMS practical in more settings:
Lower training and maintenance overhead for learned DB components
A more realistic deployment story for cloud providers where per-customer training can be constrained
A platform path: DB intelligence becomes a reusable capability, not a bespoke project
How These Papers Relate
These papers are about different layers—streaming ingestion, RAG serving, and DB internals—but they rhyme:
They reduce “one-off integration tax.” Ursa removes connector-heavy pipelines; FDBMs aim to remove per-dataset/per-task model retraining.
They lean into specialization where it matters. Chameleon uses different accelerators for retrieval vs generation; FDBMs use different experts for data vs plans.
They optimize for cloud realities. Ursa explicitly calls out cross-AZ replication cost pitfalls; Chameleon argues for disaggregation to match workload scaling.
The shared theme: the next wave of “data infrastructure” is being redesigned around the dominant workloads of the next decade—real-time lakehouse analytics and retrieval-augmented AI—rather than around yesterday’s assumptions.
What This Unlocks Over Time
If directions like these become mainstream, expect:
Simpler end-to-end data paths (stream → lakehouse table without a second infrastructure tier)
More predictable RAG performance and cost as retrieval becomes a first-class, independently scalable component
DBMS intelligence that feels like a platform capability, not a research deployment—if foundation-style pretraining and reusable embeddings hold up across real enterprise diversity
Adoption barriers remain real: ecosystem compatibility, operational tooling, and proving reliability under messy production workloads. But the trajectory is clear: AI-era systems are pushing us toward integrated, modular, and cost-aware infrastructure designs.
What Founders and Leaders Should Take Away
Three practical questions to ask your team (or your vendors) after reading these papers:
Where are we paying “integration tax” for data movement? (connectors, duplicated storage, multi-tier pipelines)
Are we scaling RAG as one monolith, or as two distinct workloads (retrieval + generation)?
If we’re using ML inside data systems, are we building one-off models—or investing in reusable representations?
The big shift isn’t “more AI.” It’s AI-native system design: fewer bespoke components, more modularity, and architectures that match real cost and scaling behavior.
References
Matteo Merli, Sijie Guo, Penghui Li, Hang Chen, Neng Lu. Ursa: A Lakehouse-Native Data Streaming Engine for Kafka. PVLDB 18(12): 5184–5196, 2025.
Wenqi Jiang, Marco Zeller, Roger Waleffe, Torsten Hoefler, Gustavo Alonso. Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models. PVLDB 18(1): 42–52, 2024.
Johannes Wehrstein et al. Towards Foundation Database Models. https://www.vldb.org/cidrdb/2025/towards-foundation-database-models.html



