You Already Have a Message Bus: Why We Stopped Using Kafka

Written by Alan Dennis | Jun 12, 2026 7:31:52 AM

Every data platform eventually needs to coordinate asynchronous events. A tenant finishes provisioning, the identity service needs to be notified. An ingestion job completes, the data quality check should start. A policy violation is detected, an alert needs to fire without blocking the detection path.

The conventional answer is a message bus: Apache Kafka, AWS SQS, Azure Service Bus, Google Cloud Pub/Sub. You stand one up, configure your topics, write producers and consumers, manage offsets, tune retention policies, monitor your brokers, and hope you hired someone who knows what they're doing when things go wrong.

We did this. Then we asked a question we probably should have asked earlier: why are we running a separate message bus when we already have Delta Lake?

The Insight: Delta Is Already a Log

Apache Kafka's core innovation, the one that made it the dominant distributed log, is treating a message queue as an append-only log with consumer-controlled offset tracking. Producers write records. Consumers read from an offset, advance that offset on success, and retry from the last committed offset on failure. This gives you at-least-once delivery with consumer-controlled progress.

Delta Lake has all of the same properties:

Append semantics. Delta writes are ACID-transactional. Every row appended commits fully or not at all.
Change tracking. Delta's Change Data Feed records every row insert as a versioned change event. A consumer reads CDF from version N forward and gets exactly the events published since its last checkpoint.
Consumer-controlled progress. A consumer stores its last-processed Delta version in a checkpoint table. It reads from version + 1 on the next cycle. If it crashes, it resumes from the last committed checkpoint on restart.
Multi-subscriber support. Each consumer has its own checkpoint, advancing independently. No consumer group coordination.
Permanent, queryable history. Messages are rows. You can SQL-query the full event history at any time.

The gap between "has the right properties" and "works as a production event bus" is latency and publishing ergonomics. That's what DeltaBus adds.

How DeltaBus Works

DeltaBus is a publish-subscribe event bus implemented entirely on Delta Lake and checkpoint-based Change Data Feed consumption. Its central data structure is a messages table — a CDF-enabled Delta table where each row is an event.

Publishing is dual-mode and transparent to the caller. When a Databricks ZeroBus endpoint is configured, the ZeroBus SDK provides gRPC streaming ingest with ~5-second acknowledged delivery. When it's not (or if the endpoint goes down), an in-memory buffer flushes to Delta every 5 seconds via a single batch write. Both modes produce identical outcomes: a row in the messages table, visible to consumers.

Consuming works through batch CDF polling. A consumer defines a topic filter, a handler function, and a stable consumer ID. On each poll cycle, it reads CDF changes from the last committed checkpoint forward, dispatches messages to the handler, and advances the checkpoint only after successful processing. Crash mid-processing? The consumer resumes from the last committed version on restart. Dead-lettered messages (after max retries) are preserved with full error context in a dedicated table for inspection and replay.

The entire system uses three Delta tables: messages (the event store), checkpoints (consumer progress), and dead_letters (failed messages). That's it.

The Economics Are Hard to Ignore

Here's a comparison at 10 million events per day — a moderate platform operations volume:

Solution	Estimated Monthly Cost
Self-managed Kafka	$2,800–8,000
Managed queue service (SQS, Service Bus)	$650–3,000
DeltaBus	~$50

DeltaBus costs approximately $50/month in Delta storage, with no per-message ingestion fee and no idle compute. Kafka requires broker clusters, coordination services, and continuous monitoring. Managed queue services eliminate the operational burden but charge per-message and apply TTL-based retention policies that delete event history — making compliance reporting and forensics significantly harder.

DeltaBus retains events permanently. That 10-million-event-per-day history is queryable by SQL at any time. "Which components published the most events last week?" is a five-line SQL query against the messages table, not a pipeline into an external observability tool.

What DeltaBus Is and Isn't ?

DeltaBus is the right choice for platform operations events: tenant provisioning, workflow coordination, ingestion lifecycle, audit logging, compliance tracking. For these workloads, 5–10 second end-to-end latency is entirely acceptable, permanent event history is a feature (not a liability), and Unity Catalog governance over the event stream is essential.

DeltaBus is not the right choice for sub-second latency requirements, billions of events per second, cross-workspace federation, or request-reply patterns. These workloads call for dedicated streaming infrastructure. DeltaBus doesn't try to compete with Kafka at its own game — it eliminates Kafka from the 80% of the workload where Kafka is operational overhead without corresponding benefit.

Governance That Extends to the Event Stream

One of the less-discussed costs of an external message bus is the governance gap. Events published to Kafka or SQS exist outside your Unity Catalog governance boundary. They can't be governed by table-level ACLs, included in data lineage, or queried by your SQL tools.

DeltaBus events live inside your Databricks workspace as rows in a Delta table. Unity Catalog ACLs apply to the messages table exactly like any other data asset. The event stream participates in the same governance model as your Bronze, Silver, and Gold tables. For regulated industries where every data access needs to be auditable, this matters.

Production Reality

Auraa's Covasant platform runs DeltaBus in production across all platform operations workflows. Tenant provisioning commands, ingestion lifecycle events, data quality notifications, and audit records all flow through DeltaBus. The messaging infrastructure has consumed effectively zero engineering time after initial setup — because there is nothing to manage. The Delta tables are governed by Unity Catalog. The SQL Warehouse is managed by the workspace. The operational surface area is the messages table, the checkpoints table, and the dead_letters table.

Before adopting DeltaBus, the honest question to ask is: does my workload have requirements that justify a dedicated message bus? For platform operations at lakehouse scale, the answer is almost always no.

Read the full technical whitepaper.

DeltaBus: A Lakehouse-Native Event Bus Pattern for Data Platforms covers the complete architecture, code examples, cost analysis, fault tolerance model, and when to choose a dedicated bus instead.

Read the Whitepaper

Frequently asked questions

Do I still need Kafka or a separate message bus if I'm already running Delta Lake on Databricks?

For most platform operations workloads, no. Delta Lake already behaves like an append-only log with consumer-controlled offset tracking, which is the same core property that made Kafka the default. The DeltaBus pattern adds the publishing ergonomics and polling layer on top, so events become rows in a governed Delta table. Kafka still wins for sub-second latency, billions of events per second, cross-workspace federation, or request-reply patterns. For tenant provisioning, ingestion lifecycle, audit logging, and compliance events, a Delta-native bus removes infrastructure you would otherwise have to staff and monitor.

We bought Databricks but a year later our data still isn't AI-ready. What's wrong?

The usual cause is that modernization was scoped as a multi-quarter, SI-dependent program where each source gets mapped by hand. Standing up the lakehouse is only part of it; getting clean, connected, agent-ready data on top is where the time goes. Auraa is built to compress that by porting your sources into Databricks natively rather than rebuilding the mapping work source by source. It's positioned as the fastest path to AI-ready data on Databricks.

What is the fastest way to get AI-ready data on Databricks?

The bottleneck is rarely the lakehouse itself; it's connecting and conforming each source so agents and models can reason over it. Auraa is a Databricks-native product built specifically for this step. Because it runs natively on Databricks, schema and structural changes propagate without a separate integration project, which is the difference between a productized path and a hand-built one.

Is DeltaBus a product I can buy, or a pattern?

DeltaBus is an architectural pattern, a publish-subscribe event bus implemented entirely on Delta Lake and checkpoint-based Change Data Feed consumption. It uses three Delta tables: messages, checkpoints, and dead_letters. It runs in production inside Auraa, Covasant's Databricks-native data readiness platform. The full technical whitepaper covers the architecture, fault tolerance model, and cost analysis.

How much does running a message bus on Delta Lake cost versus Kafka?

At roughly 10 million events per day, a Delta-native bus runs around $50 per month in storage, against an estimated $2,800 to $8,000 for self-managed Kafka and $650 to $3,000 for managed queue services. There's no per-message ingestion fee and no idle compute. Managed queues also apply retention policies that delete event history, while a Delta-native approach keeps the full event history queryable by SQL.

How do I keep my event stream inside Unity Catalog governance?

Events published to an external bus like Kafka or SQS sit outside your Unity Catalog boundary, so they can't be governed by table-level ACLs, included in lineage, or queried by your SQL tools. When events live as rows in a Delta table, Unity Catalog ACLs apply to them exactly like any other data asset, and the event stream participates in the same governance model as your Bronze, Silver, and Gold tables. For regulated industries that need every data access to be auditable, this closes the governance gap.

What latency can a Delta Lake event bus actually deliver?

With a Databricks ZeroBus endpoint configured, the pattern gives roughly 5-second acknowledged delivery via gRPC streaming ingest. Without it, an in-memory buffer flushes to Delta every 5 seconds. Both produce the same outcome: a committed row visible to consumers. For platform operations events, 5 to 10 second end-to-end latency is acceptable. If you need sub-second delivery, use dedicated streaming infrastructure instead.

How does a Delta-native bus handle failures and retries?

Consumers track progress with a checkpoint storing the last-processed Delta version and advance it only after successful processing. A consumer that crashes mid-processing resumes from its last committed version on restart, giving at-least-once delivery. Messages that exhaust their retries are preserved with full error context in a dead_letters table for inspection and replay.

Auraa, ARIIA, CAMS, Data Nexus, which one do I need?

Auraa is for enterprises on Databricks that want AI-ready data faster; it's the data readiness product. ARIIA is the reasoning and intelligence platform you run when you want to skip building a traditional data lake and go straight to a knowledge graph and reasoning layer with agents on top. CAMS is the agent lifecycle management platform. Data Nexus is the agentic master data management product that keeps records clean. If you already have Databricks, the entry point is Auraa.

View full post