Skip to main content

Apache Iceberg: Unifying Batch and Streaming in the Data Lakehouse

 

Introduction

The line between batch and streaming processing is blurring. Apache Iceberg, an open table format, is at the forefront of this shift, promising to unify these paradigms in a single, scalable framework. But how does it achieve this? Can it replace traditional streaming systems like Kafka? And how do tools like Flink and Spark Streaming leverage Iceberg for streaming workloads? Let’s dive in.

What is Apache Iceberg?

Apache Iceberg is a table format designed for data lakehouses, built to manage massive datasets on storage systems like S3, HDFS, or ADLS. Unlike traditional file-based formats, Iceberg abstracts file management with a rich metadata layer, enabling features like ACID transactions, snapshot isolation, and schema evolution. Its ability to handle both batch and streaming workloads on the same table makes it a game-changer.

Unifying Batch and Streaming: How Iceberg Does It

Iceberg bridges the batch-streaming divide with several key features:

  1. Unified Table Format: Iceberg provides a single table structure that both batch and streaming jobs can read and write. Whether you’re running a nightly batch ETL with Spark or a real-time streaming pipeline with Flink, it’s the same dataset—no duplication required.

  2. Snapshot Isolation: Every write creates a new snapshot, a point-in-time view of the table. Batch jobs can process historical data from a specific snapshot, while streaming jobs append updates incrementally, all with consistent reads.

  3. Incremental Processing: Iceberg supports reading only new or changed data since the last snapshot, perfect for streaming engines like Flink or Spark Structured Streaming. Options like stream-from-timestamp make this efficient.

  4. Merge-on-Read (MoR): For streaming, MoR allows frequent updates (inserts, updates, deletes) to be written quickly to a change log, merged at read time. This contrasts with Copy-on-Write (CoW), which suits batch jobs better.

  5. Engine Agnosticism: Iceberg integrates with Spark, Flink, Trino, and more, letting you mix and match engines for batch and streaming on the same table.

  6. File Management: Features like compaction ensure small files from streaming don’t degrade performance, keeping the table optimized for both workloads.

Imagine an e-commerce platform: a streaming job ingests clickstream data into an Iceberg table every minute, while a batch job aggregates sales trends hourly. Both operate seamlessly on the same table, simplifying your architecture.

Can Iceberg Replace Kafka?

A common question is whether Iceberg’s streaming capabilities can eliminate the need for Kafka. The answer? It depends.

  • Kafka’s Strength: Kafka is a distributed streaming platform for real-time event ingestion and distribution. It’s a message broker, excelling at low-latency, high-throughput event streams (e.g., millions of events/sec with sub-second delivery).

  • Iceberg’s Role: Iceberg is a storage and processing layer, not an ingestion system. It unifies data downstream but doesn’t replace Kafka’s role as a buffer or pub-sub system.

  • When to Skip Kafka: If your producers can write directly to storage (e.g., S3) and your streaming engine (e.g., Flink) can process from there, you might bypass Kafka. This simplifies the stack but sacrifices Kafka’s buffering, fault tolerance, and multi-consumer distribution.

  • Performance Trade-Off: Kafka + Iceberg offers lower latency (seconds) and higher throughput than Iceberg alone (tens of seconds, lower throughput due to storage limits). For real-time needs, Kafka remains king.

In short, Iceberg complements Kafka rather than replaces it—unless your use case tolerates simpler, slower streaming.

Streaming with Flink and Spark on Iceberg

Flink and Spark Structured Streaming can “listen” to changes in Iceberg tables, but it’s not quite like Kafka’s event-driven streaming. Here’s how they work:

Flink with Iceberg

  • How It Works: Flink’s Iceberg source polls the table’s metadata (e.g., every 5 seconds) for new snapshots, processing only the new data. Options like monitor-interval and startSnapshotId configure this.

  • Streaming Feel: It’s near real-time, with latency tied to snapshot commit frequency and polling intervals—think seconds, not milliseconds.

Spark Structured Streaming with Iceberg

  • How It Works: Spark polls Iceberg via streaming queries, using triggers (e.g., every 10 seconds) to check for new snapshots or data. Options like streaming-start-timestamp define the starting point.

  • Streaming Feel: Micro-batch processing, with latency in the seconds-to-minutes range.

Kafka Comparison

Unlike Kafka’s push-based, event-level streaming, Iceberg’s pull-based model processes snapshot-level changes. Kafka delivers sub-second latency; Iceberg with Flink/Spark is near real-time. For use cases like minute-by-minute analytics, this works well—but for sub-second needs, Kafka still shines.

Practical Example

Consider a pipeline tracking user clicks:

  • Kafka + Iceberg: Kafka ingests clicks at 1M events/sec, Flink writes to Iceberg every 5 seconds, and analytics see data in 10 seconds.

  • Iceberg Alone: Producers write to S3, Spark polls every 10 seconds, commits to Iceberg, and analytics see data in 30 seconds. Throughput drops to ~100K events/sec.

Conclusion: Iceberg’s Place in Your Stack

Apache Iceberg unifies batch and streaming by offering a single, ACID-compliant table format that simplifies data architecture. It doesn’t replace Kafka’s real-time ingestion but can reduce complexity downstream, especially when paired with Flink or Spark Streaming. For modest streaming needs, you might skip Kafka entirely—just ensure your latency and throughput requirements align with Iceberg’s strengths.

As of March 2025, Iceberg’s adoption is growing, and its ability to streamline data lakehouses makes it a must-know tool. Whether you’re unifying workloads or rethinking your streaming stack, Iceberg is worth a closer look.


Comments

Popular posts from this blog

Functional Programming in Scala for Working Class OOP Java Programmers - Part 1

Introduction Have you ever been to a scala conf and told yourself "I have no idea what this guy talks about?" did you look nervously around and see all people smiling saying "yeah that's obvious " only to get you even more nervous? . If so this post is for you, otherwise just skip it, you already know fp in scala ;) This post is optimistic, although I'm going to say functional programming in scala is not easy, our target is to understand it, so bare with me. Let's face the truth functional programmin in scala is difficult if is difficult if you are just another working class programmer coming mainly from java background. If you came from haskell background then hell it's easy. If you come from heavy math background then hell yes it's easy. But if you are a standard working class java backend engineer with previous OOP design background then hell yeah it's difficult. Scala and Design Patterns An interesting point of view on scala, is...

Alternatives to Using UUIDs

  Alternatives to Using UUIDs UUIDs are valuable for several reasons: Global Uniqueness : UUIDs are designed to be globally unique across systems, ensuring that no two identifiers collide unintentionally. This property is crucial for distributed systems, databases, and scenarios where data needs to be uniquely identified regardless of location or time. Standardization : UUIDs adhere to well-defined formats (such as UUIDv4) and are widely supported by various programming languages and platforms. This consistency simplifies interoperability and data exchange. High Collision Resistance : The probability of generating duplicate UUIDs is extremely low due to the combination of timestamp, random bits, and other factors. This collision resistance is essential for avoiding data corruption. However, there are situations where UUIDs may not be the optimal choice: Length and Readability : UUIDs are lengthy (typically 36 characters in their canonical form) and may not be human-readable. In UR...

Bellman Ford Graph Algorithm

The Shortest path algorithms so you go to google maps and you want to find the shortest path from one city to another.  Two algorithms can help you, they both calculate the shortest distance from a source node into all other nodes, one node can handle negative weights with cycles and another cannot, Dijkstra cannot and bellman ford can. One is Dijkstra if you run the Dijkstra algorithm on this map its input would be a single source node and its output would be the path to all other vertices.  However, there is a caveat if Elon mask comes and with some magic creates a black hole loop which makes one of the edges negative weight then the Dijkstra algorithm would fail to give you the answer. This is where bellman Ford algorithm comes into place, it's like the Dijkstra algorithm only it knows to handle well negative weight in edges. Dijkstra has an issue handling negative weights and cycles Bellman's ford algorithm target is to find the shortest path from a single node in a graph ...