Open table format for huge analytic datasets. Apache-2.0. Schema evolution, hidden partitioning, time travel, multi-engine compatibility (Spark, Trino, Flink, Snowflake, Databricks). Industry-standard lakehouse format alongside Delta Lake and Hudi.
Apache Iceberg is the open table format that won the lakehouse format wars — Apache-2.0, broadest engine support (Spark, Trino, Flink, Snowflake, Databricks, BigQuery, Redshift, Athena), and the most mature schema-evolution + time-travel + hidden-partitioning story among the table formats. Pick Iceberg when you're building or modernizing a lakehouse stack and want vendor-neutral table format with the broadest tooling reach. Pick Delta Lake if Databricks is the primary engine; pick Hudi if streaming/CDC is the primary use case. The market has consolidated around Iceberg for greenfield deployments.
Iceberg's trust posture is rooted in its design: tables are not directories of Parquet files; they're explicit metadata-tracked snapshots with full transaction history. From a Trust Before Intelligence lens, this is the quietest but deepest trust-relevant feature in modern data architecture. Every write produces a new snapshot; the lineage of what changed when is queryable. Time-travel queries enable forensic analysis of data state at any point. Schema evolution is principled (additive by default; renames don't break readers). The compliance-relevant property: GDPR right-to-be-forgotten can be implemented as a snapshot-deletion + rewrite, with the audit trail captured in metadata. Iceberg makes the lakehouse genuinely auditable in a way that 'directories of Parquet' never was.
Read latency depends on engine + manifest pruning quality; sub-second on filtered queries with good partition design. Cap rule N/A — engine determines latency, not format.
Engine-agnostic SQL via Spark/Trino/Flink/Snowflake/etc. Format itself isn't a query language but every major engine speaks Iceberg natively. Cap rule N/A.
Catalog-level ACLs (Glue, Polaris, Nessie, REST). ABAC via Polaris policies + engine-level identity propagation. Cap rule N/A.
Multi-cloud, multi-engine, multi-catalog. Strongest A in the catalog — Iceberg is engine-neutral by design.
Manifest files capture exhaustive metadata: schema history, partition spec, snapshot lineage, file statistics. Strongest C in catalog. Cap rule N/A.
Time travel via snapshot IDs. Manifest list inspection. Partition stats. SHOW HISTORY queries reveal full transaction log. Cap rule N/A.
G1=Y (catalog ACLs), G2=Y (snapshot history is full audit), G3=N, G4=Y (snapshot rollback + time-travel reproduce historical state), G5=N, G6=N. 3/6 -> 3.
O1=Y (engine metrics + catalog metadata), O2=N, O3=N (cost is engine-attributed not format-attributed), O4=Y (snapshot-diff alarms), O5=N, O6=N. 2/6 -> 3 lenient (snapshot inspection enables MTTD).
A1=Y (manifest pruning enables sub-second filtered reads), A2=Y (writes produce new snapshots; readers see consistent view), A3=N, A4=Y (multi-region replication via S3/GCS), A5=Y (production at hyperscale at Apple, Netflix, Linkedin), A6=Y (parallel scans via manifest + file partitioning). 5/6 -> 4.
L1=N, L2=N, L3=N, L4=N, L5=Y (column metadata + partition spec + tags as terminology), L6=N. 1/6 -> 2 strict; bumped to 3 for schema-evolution history depth.
S1=Y, S2=Y, S3=Y (ACID via manifest atomicity — concurrent writers serialize via metadata), S4=Y (schema enforced + evolution rules principled), S5=N (Iceberg doesn't validate content quality), S6=Y (snapshot diffs flag anomalies). 5/6 -> 5 (peer with PostgreSQL on data integrity).
Best suited for
Compliance certifications
Apache Iceberg (the format) holds no compliance certifications — compliance lives in the underlying storage substrate (S3, GCS, ADLS — see those rows for FedRAMP/HIPAA/SOC 2 posture) and the catalog provider (Polaris/Glue/Nessie). For regulated workloads, choose substrate + catalog combinations matching your compliance gate; Iceberg the format inherits substrate compliance.
Use with caution for
Choose Delta when Databricks is the primary engine. Iceberg wins on multi-engine reach + vendor-neutral governance; Delta wins on Databricks-native integration + UniForm (which now provides Iceberg interop). Market has consolidated around Iceberg for greenfield; Delta is typical when Databricks is already chosen.
View analysis →Choose Hudi when streaming/CDC ingestion is the primary use case — Hudi optimizes for incremental processing more aggressively than Iceberg. Iceberg wins on broad engine support; Hudi wins on CDC-driven workloads.
View analysis →Snowflake is a managed warehouse that increasingly speaks Iceberg natively. Iceberg wins on data sovereignty + multi-engine; Snowflake wins on managed compliance + zero-ops + integrated governance. Many stacks use both — Snowflake as compute, Iceberg as storage format.
View analysis →Role: L1 Lakehouse Format. Open table format for analytical datasets stored on object storage (S3, GCS, ADLS). Read/written by L1 query engines and warehouses; managed via L1 catalogs (Glue, Polaris, Nessie, REST).
Upstream: Receives writes from L2 streaming (Flink, Spark Structured Streaming, Kafka Connect), L3 transformation (dbt, SQLMesh via Spark/Trino), and direct application writes via Iceberg client libraries.
Downstream: Read by L1 query engines (Trino, Spark, DuckDB) and warehouses (Snowflake, BigQuery, Redshift, Athena via Iceberg connectors). Catalog metadata feeds L3 lineage tools (DataHub, Marquez via OpenLineage).
Mitigation: Use a catalog with proper concurrency control (Glue, Polaris, Nessie, REST). Don't use file-system-only catalog for production. Test concurrent-write scenarios.
Mitigation: Schedule expire_snapshots + remove_orphan_files maintenance. Document the retention policy + ops cadence. Monitor orphan-file ratio.
Mitigation: For GDPR: rewrite affected partitions to delete data; expire snapshots within GDPR window; verify cleanup. Build automated GDPR-fire-drill that exercises the deletion path.
Mitigation: Plan catalog migration capability — REST catalog spec is the portability bridge. Test cross-catalog reads if multi-catalog is a future need.
Mitigation: Establish schema-change governance. CI gate: schema PRs must validate downstream consumer compatibility. Test type promotions on representative consumers before applying.
Iceberg gives you ACID + schema evolution + time travel that Hive lacks. Multi-engine support means one format across Spark batch + Trino interactive + Flink streaming.
Snapshot-based audit trail + principled deletion enable compliance-grade lifecycle. Time-travel queries support forensic analysis. Catalog ACLs + Polaris policies provide authorization.
Delta Lake is Databricks-native and may be operationally simpler. Iceberg via UniForm is fine but adds complexity if Databricks is the only engine.
This analysis is AI-generated using the INPACT and GOALS frameworks from "Trust Before Intelligence." Scores and assessments are algorithmic and may not reflect the vendor's complete capabilities. Always validate with your own evaluation.