Open table format optimized for incremental processing and CDC ingestion. Apache-2.0. Copy-on-write and merge-on-read storage modes, record-level upserts, change streams. Pioneering lakehouse format from Uber.
Apache Hudi is the open table format optimized for incremental processing and CDC ingestion — Apache-2.0, originated at Uber. Distinct from Iceberg + Delta: where Iceberg + Delta optimize for analytical reads with strong ACID semantics, Hudi optimizes for CDC-driven writes (record-level upserts, change streams as a first-class primitive). Pick Hudi when the primary use case is streaming/CDC ingestion with downstream analytical reads — Uber, Robinhood, Walmart use Hudi at scale for exactly this pattern. Pick Iceberg for greenfield analytical lakehouses; pick Hudi for CDC-driven pipelines.
Hudi's design center is record-level mutation: UPDATEs and DELETEs against parquet files at scale, which is harder than the append-only pattern Iceberg/Delta optimize. From a Trust Before Intelligence lens, this matters for two compliance scenarios: GDPR right-to-be-forgotten (record-level DELETE is native, not a partition rewrite), and CDC source-of-truth (the lakehouse mirrors the operational source's state including deletes). The trade-off: Hudi's metadata model is more complex than Iceberg's; operational maturity requires investment.
Read latency depends on storage mode — copy-on-write reads are fast, merge-on-read trades read for write speed. Cap rule N/A.
Engine-agnostic SQL via Spark/Flink/Hive/Trino. Cap rule N/A.
Catalog-level ACLs only; no native row-level. Cap rule applied.
Multi-engine, multi-cloud.
Timeline metadata captures every commit, deletion, compaction. Strong C.
Timeline = full audit trail. Per-commit metadata. Strong T.
G1=N, G2=Y (timeline audit), G4=Y (rollback via timeline), 2/6 -> 3.
Timeline visibility lifts O. 2/6 -> 3.
Streaming-optimized writes; multi-engine reads. 5/6 -> 4.
Schema + partition metadata. 1/6 -> 3 lenient.
ACID + record-level mutations + replication. 5/6 -> 4.
Best suited for
Compliance certifications
Apache Hudi (the format) holds no compliance certifications — compliance lives in the underlying storage substrate (S3/GCS/ADLS) and the catalog provider. For regulated workloads, choose substrate + catalog combinations matching your compliance gate.
Use with caution for
Iceberg has broader engine support + simpler metadata model + larger community. Hudi wins on CDC/streaming-write optimization + record-level mutation. Greenfield analytical lakehouse: Iceberg. CDC-driven pipelines: Hudi.
View analysis →Delta is Databricks-native + has UniForm interop. Hudi wins on streaming-write semantics; Delta wins on Databricks ecosystem fit.
View analysis →Role: L1 Lakehouse Format optimized for streaming/CDC writes + record-level mutation. Specialized peer to Iceberg + Delta.
Upstream: Receives writes from L2 streaming (Flink, Spark Structured Streaming, DeltaStreamer with Debezium CDC).
Downstream: Read by L1 query engines (Spark, Flink, Trino, Hive) and warehouses with Hudi connectors.
Mitigation: Match storage mode to write/read ratio. CoW for read-heavy; MoR for write-heavy. Test both on representative workload before committing.
Mitigation: Schedule async compaction. Monitor file count vs row count. Document compaction cadence in ops runbook.
Mitigation: Use Hudi's record-level DELETE API for GDPR. Verify physical removal post-delete. Test the deletion workflow in CI.
Mitigation: Use a catalog with concurrency control (Glue, Hive Metastore with locking). Don't rely on file-system-only catalog for production multi-writer.
Mitigation: Validate cross-engine compatibility on representative tables. Pin compatible Hudi versions across writers + readers.
DeltaStreamer (Hudi) ingests Debezium CDC events. Record-level upserts maintain Postgres state in lakehouse. Analytical readers (Trino, Spark) query consistent snapshot.
Hudi's native DELETE API handles right-to-be-forgotten. Audit trail in timeline. Compliance team can verify physical deletion.
Iceberg simpler + broader engine support. Use Hudi when CDC IS the value proposition.
This analysis is AI-generated using the INPACT and GOALS frameworks from "Trust Before Intelligence." Scores and assessments are algorithmic and may not reflect the vendor's complete capabilities. Always validate with your own evaluation.