The data lake landscape is undergoing a fundamental transformation. Traditional Hive tables are giving way to a new generation of open table formats—Apache Iceberg, Apache Hudi, Delta Lake, and emerging contenders like DuckLake—each promising to solve the inherent challenges of managing massive datasets at scale.
But which format fits your architecture? This session cuts through the marketing noise to deliver practical insights for data architects and engineers navigating this critical decision. We’ll explore how these formats tackle schema evolution, time travel, ACID transactions, and metadata management differently, and what these differences mean for your data platform’s performance, reliability, and total cost of ownership.
Drawing from real-world implementations, you’ll discover the hidden complexities, unexpected benefits, and common pitfalls of each approach. Whether you’re modernizing legacy Hive infrastructure, building greenfield data lakes, or evaluating lakehouse architectures, you’ll leave with a clear framework for choosing and implementing the right open table format for your specific use case—and the confidence to justify that decision to stakeholders.
Highlights:
- Format Face-Off: Direct comparison of Hive, Iceberg, Hudi, Delta Lake, and DuckLake capabilities across critical dimensions including ACID guarantees, partition evolution, and query performance optimization
- Real-World Battle Scars: Lessons learned from production deployments including migration strategies, performance tuning insights, and cost implications at petabyte scale
- Ecosystem Integration Deep-Dive: How each format plays with modern compute engines (e.g. Spark, Flink, Trino, Presto, DuckDB) and cloud platforms, plus vendor lock-in considerations
- The Hidden Costs: Beyond storage and compute—examining operational overhead, team expertise requirements, and long-term maintenance implications of your format choice
- Decision Framework: A practical methodology for evaluating which open table format aligns with your organization’s data architecture, workload patterns, and strategic goals.