Open-source metadata platform for data discovery, observability, and governance.
DataHub provides open-source metadata management for data discovery and lineage tracking, solving the L3 trust problem of 'what data exists and how it connects.' The key tradeoff is comprehensive metadata coverage versus operational complexity — while it captures rich lineage and business glossaries, it requires significant engineering investment to maintain data freshness and prevent metadata drift.
L3 catalog failures trigger the S→L→G cascade — bad metadata (Solid) corrupts semantic understanding (Lexicon) which creates governance violations (Governance). When DataHub's metadata becomes stale or incomplete, agents confidently retrieve wrong data or miss critical context. Since trust is binary, a catalog that's 85% accurate is effectively 0% trusted — agents must be certain they're accessing the right entities with complete lineage visibility.
Elasticsearch-based search typically delivers sub-500ms metadata queries for <10M entities, but lineage graph traversal can hit 3-8 seconds for complex datasets with 6+ hops. No intelligent caching for common lineage queries. Cold starts after service restart take 30-45 seconds for index warming.
GraphQL API with decent documentation, but requires learning DataHub's entity model (datasets, schemas, charts, dashboards). No natural language search — users must understand URN syntax for programmatic access. Python SDK helps but adds complexity for non-engineering teams.
RBAC-only with platform policies and resource-level permissions. No ABAC support for contextual access (time, location, purpose). Missing column-level security. Service account authentication via JWTs but no fine-grained delegation controls. Cannot evaluate who/what/when/where/why access patterns required for HIPAA minimum-necessary compliance.
Multi-cloud deployment possible but complex — requires managing Kafka, Elasticsearch, MySQL/PostgreSQL separately. No built-in drift detection for metadata quality degradation. Plugin architecture exists but limited ecosystem. Migration between instances requires custom ETL processes.
Strong lineage tracking across systems with column-level and table-level relationships. Integrates with 50+ sources including Snowflake, Databricks, Airflow, dbt. However, real-time lineage updates require custom development — most integrations are batch-based with 4-24 hour lag.
Comprehensive audit logs for metadata changes with user attribution and timestamps. Lineage graphs show data flow but no cost attribution per query or downstream impact analysis. Activity feed shows 'who changed what' but missing 'why' context and business impact assessment.
No automated policy enforcement — purely metadata management without data access controls. Cannot prevent unauthorized data access, only document it. No integration with policy engines like OPA. Data classification exists but doesn't trigger governance actions automatically.
Basic metrics on metadata freshness and user adoption. No LLM-specific observability for semantic search quality or embedding drift. Grafana dashboards available but require manual setup. Missing cost attribution and performance optimization insights for downstream AI workloads.
No formal SLA offered. OSS deployment requires self-managed high availability with typical RTO 15-30 minutes depending on infrastructure. Acryl Cloud offers better availability but no published SLA. Disaster recovery depends entirely on underlying infrastructure choices.
Supports business glossaries and tags for terminology consistency. Can import ontologies but no native SNOMED CT or ICD-10 support — requires custom development. Strong schema evolution tracking but limited semantic relationship modeling between business concepts.
5+ years in market with LinkedIn provenance. 500+ stars on GitHub, active community. However, breaking changes between major versions (0.8.x to 0.9.x required significant migration effort). Data quality depends on ingestion pipeline reliability — no built-in data quality guarantees.
Best suited for
Compliance certifications
SOC2 Type II available for Acryl Cloud. No HIPAA BAA, FedRAMP, or PCI DSS certifications. ISO 27001 not specified.
Use with caution for
Splink excels at probabilistic entity resolution but lacks comprehensive metadata management. Choose Splink when entity deduplication accuracy is critical, DataHub when you need full data catalog capabilities with basic entity linking.
View analysis →AWS Entity Resolution offers managed service simplicity and better availability SLAs but locks you into AWS ecosystem. Choose AWS for simple entity matching needs, DataHub for complex multi-cloud lineage tracking.
View analysis →Tamr provides ML-powered data mastering with better entity confidence scores but at significantly higher cost. Choose Tamr for complex master data management, DataHub for metadata-focused catalog needs.
View analysis →Role: Provides business context and entity relationships to transform raw data into semantically meaningful information for agent consumption
Upstream: Ingests metadata from L1 storage systems (Snowflake, S3, databases) and L2 pipelines (Airflow, dbt, Kafka) via batch connectors
Downstream: Feeds semantic context to L4 RAG systems and L5 governance tools for policy enforcement and access control decisions
Mitigation: Implement real-time CDC from source systems and monitor metadata freshness with alerting
Mitigation: Layer L5 governance tools like Privacera or Immuta for fine-grained access controls
Mitigation: Automated schema validation and drift detection with L6 observability tools
RBAC-only permissions cannot enforce minimum-necessary access controls required by HIPAA. Missing patient data lineage tracking for consent management.
Strong audit trails and lineage tracking support SOX requirements, but requires additional governance layer for access controls and data classification enforcement.
Batch-based metadata updates create 4-24 hour lag for product catalog changes, breaking real-time recommendation accuracy.
This analysis is AI-generated using the INPACT and GOALS frameworks from "Trust Before Intelligence." Scores and assessments are algorithmic and may not reflect the vendor's complete capabilities. Always validate with your own evaluation.