Open-source SQL-first data transformation framework. The OSS Python package that powers dbt Cloud — same compilation, testing, documentation, and lineage. Does NOT include dbt Cloud's Semantic Layer, IDE, hosted scheduler, or observability suite. Apache-2.0 license.
dbt Core is the open-source SQL-first transformation framework that powers dbt Cloud — same compilation, testing, documentation, and lineage engine, distributed as the Apache 2.0 Python package. It's THE OSS standard for analytics engineering: most data teams either use dbt Core directly via their own orchestrator (Airflow, Dagster, Prefect, GitHub Actions) or use dbt Cloud (the managed wrapper) on top of it. Choosing Core over Cloud is a commitment to operate the orchestration yourself in exchange for full control and zero managed-service dependency. The 2-point GOALS gap to dbt Cloud reflects two missing features: the Semantic Layer and the Cloud-only observability suite — both of which can be approximated with OSS alternatives if needed.
dbt Core's defining trust property is **transformation transparency through source control**. Where commercial transformation tools (Looker LookML in semantic mode, AtScale, Informatica) hide transformation logic in proprietary metadata stores, dbt Core puts every model definition, test, and macro in plain text in a Git repo. Every transformation is reviewable, auditable, version-controlled, and reversible. Combined with dbt's tests (which execute as SQL against the warehouse, not against simulated data), this gives data agents a substrate where 'why does this column have this value?' can always be answered by reading the model file. The trust trade-off: dbt Core is build-time only — it doesn't enforce at runtime, so warehouse-level access controls and freshness monitoring still need to be in place. dbt's tests catch broken assumptions; they don't prevent broken queries from running.
Same as dbt Cloud — execution speed is determined by the warehouse, not dbt itself. dbt orchestration overhead is tens of milliseconds per model; the actual SELECT or CREATE TABLE runs at warehouse speed. Cap rule N/A. Compilation is fast (sub-second for typical projects), and parallel execution via --threads scales linearly to warehouse-cluster limits.
Jinja + SQL templates are the well-known mental model for data transformation. {{ ref('upstream_model') }}, {{ source('schema', 'table') }}, {{ var('environment') }} feel natural to anyone who's written SQL. macros provide reusable abstractions without leaving the SQL paradigm. Cap rule N/A — Jinja is a standard templating language, not a 'proprietary query language' in the methodology sense.
RBAC at warehouse level (Snowflake roles, BigQuery IAM, Postgres GRANT); dbt orchestrates but doesn't enforce its own permission model. Cap rule applied: 'RBAC-only without ABAC -> cap at 3.' For ABAC over transformed data, push policy enforcement to L5 (OPA, Cedar) or use warehouse RLS at L1 (e.g., Postgres RLS, Snowflake row access policies).
Runs on any orchestrator: GitHub Actions for CI/CD, Airflow for production scheduling, Dagster for asset-aware orchestration, Prefect for workflow management, Kubernetes CronJobs for simple cases, even cron + bash. Cap rule N/A. dbt Cloud (A=4) pins you to their managed runtime — dbt Core gives you orchestration sovereignty.
Many warehouse adapters: Snowflake, BigQuery, Redshift, Postgres, Spark, Databricks, DuckDB, Trino, MS SQL Server, Synapse, Materialize, dozens more via community packages. Cross-warehouse models work. Cap rule N/A.
Run results metadata is comprehensive (every model run logs duration, rows affected, errors, dependencies), and dbt's lineage graph provides full transformation transparency. But no native cost attribution — dbt doesn't know how much each model run costs. Cap rule N/A. For cost transparency, integrate with warehouse-side accounting (Snowflake QUERY_HISTORY, BigQuery INFORMATION_SCHEMA) or a layer like SELECT.dev.
G1=N (no runtime ABAC — dbt is build-time only, runtime authorization is the warehouse's responsibility), G2=Y (run logs cover every transformation execution with full input/output metadata), G3=Y (PR review IS HITL — dbt's Git-first workflow means every transformation change goes through human review before merge), G4=Y (dbt is built around model versioning — sources, exposures, packages all carry semantic versioning), G5=N (no AI threat modeling in transformation scope), G6=Y (deployment-mapped compliance via the warehouse — Snowflake's SOC 2 covers transformation outputs, Postgres on RDS covers HIPAA via BAA). 4/6 -> 4.
O1=Y (run metrics: rows, duration, success rates exposed via run_results.json), O2=N (no native distributed tracing — though manifest.json + run_results.json give run-level visibility), O3=N (LLM cost tracking N/A; warehouse query cost is the analog and isn't tracked natively), O4=Y (run failures detected immediately via dbt run --status; integrates with PagerDuty, Slack, etc.), O5=Y (freshness checks built-in via dbt source freshness — explicit drift detection at the data layer), O6=Y (lineage graph IS explainability — every column in every model can be traced back through dependencies). dbt Core lacks Cloud's enhanced observability suite (continuous CI metadata, semantic layer telemetry). 4/6 -> 4.
A1=Y (sub-2s p95 — dbt operational metadata commands are fast), A2=Y (freshness checks built-in for data freshness SLA enforcement), A3=Y (warehouse cache: Snowflake, BigQuery materialized views, etc., reduce repeat-query costs), A4=Y (CI/CD reliability — dbt Core in GitHub Actions or equivalent achieves >99.9% pipeline reliability with proper testing), A5=N (rarely 10x-load-tested by most teams — capacity planning is the warehouse's responsibility), A6=Y (--threads parallel execution scales linearly to warehouse cluster limits). 5/6 -> 4.
L1=Y (dbt models normalize entities — same customer across silos becomes one canonical model), L2=Y (sources + exposures + tests document the data dictionary, lenient interp — sources file IS the glossary), L3=N (no NL disambiguation — dbt is SQL-first, not natural-language-first), L4=N (no continuous learning — dbt is rule-based transformation), L5=Y (dbt enforces canonical naming via project conventions, schema YAML, naming patterns), L6=Y (PR review IS human evaluation of every transformation change). dbt Core lacks the Semantic Layer (Cloud+ tier feature) — that's the 1-point gap to dbt Cloud's L=5. 4/6 -> 4.
S1=Y (dbt tests assert accuracy: not_null, unique, accepted_values, custom assertions), S2=Y (dbt tests check completeness via not_null and source freshness), S3=Y (cross-system consistency enforced via ref() and source() — single source of truth per model), S4=Y (schema tests + dbt-expectations + dbt-utils provide schema validation), S5=Y (dbt's test framework as 3-stage gate: source tests + intermediate model tests + final mart tests), S6=N (anomaly detection via integrations like dbt-expectations or external tools, not native). 5/6 -> 4.
Best suited for
Compliance certifications
dbt Core the project does not hold compliance certifications. Compliance comes from: (a) dbt Cloud (the existing Commercial row at id='dbt' has SOC 2, has signed BAAs with enterprise customers), (b) the warehouse it runs against (Snowflake HIPAA BAA, BigQuery FedRAMP, etc.), (c) the orchestrator hosting it (e.g., Airflow on AWS Multi-AZ + AWS BAA). dbt-core itself processes only metadata; transformation outputs land in the warehouse where compliance applies.
Use with caution for
Choose dbt Cloud when you want the IDE, hosted scheduler, semantic layer, CI integrations, and observability suite as a managed service. dbt Core wins on cost (free), orchestration flexibility (any orchestrator), and zero managed-service dependency. dbt Cloud wins on team productivity (web IDE, hosted docs, slack integration) and comes with the Semantic Layer that Core lacks. Most teams pick Core for cost, then upgrade to Cloud when team size makes the productivity tools worth it.
View analysis →Choose Cube when you need a semantic layer specifically — Cube is purpose-built for metric definitions and consumption APIs. dbt Core wins on transformation breadth (Cube isn't a transformation tool), but Cube wins on the Semantic Layer use case that dbt Core doesn't directly support (you'd need MetricFlow or dbt Cloud).
View analysis →Choose Looker when you need a BI tool with built-in semantic modeling, dashboards, and end-user consumption. dbt Core wins on cost and OSS posture; Looker wins on end-user-facing analytics (dbt docs is for engineers). They're complementary: dbt for transformation, Looker for consumption — many stacks have both.
View analysis →Choose AtScale when you need a Universal Semantic Layer that abstracts queries across multiple warehouses and BI tools. dbt Core wins on transformation use cases; AtScale wins on the multi-warehouse universal-semantic-layer use case. Different scope — most teams pick dbt for transformation regardless of whether they also use AtScale.
View analysis →Role: L3 transformation engine. Compiles SQL templates against warehouse, runs tests, produces lineage. The build-time substrate for analytics engineering.
Upstream: Receives raw data from L2 ingestion (Fivetran, Airbyte, custom CDC) into source schemas. Configuration via profiles.yml + dbt_project.yml + sources.yml.
Downstream: Outputs transformed marts to consumption: BI tools (Looker, Tableau, Superset) at L3+, agent retrieval pipelines at L4 (RAG over transformed data), L5 governance (filtered views per role).
Mitigation: Run dbt build in every PR via CI/CD before merge. Use dbt deferral to test against production state without full warehouse rebuild. Block merges if tests fail. dbt's test framework is foundational — skipping it eliminates 80% of dbt's value.
Mitigation: Integrate dbt's run_results.json with PagerDuty, Slack, or your incident management. Alert on test failures, not just run failures. Use dbt source freshness in CI to catch upstream data delays before they break downstream models.
Mitigation: dbt is build-time only. Authorization happens at the warehouse layer (Snowflake roles, BigQuery IAM, Postgres RLS). Build dbt models that produce filtered views per role; rely on warehouse RLS for runtime enforcement.
Mitigation: Code-review every custom macro. Test macros via dbt's unit testing framework. Prefer well-known packages (dbt-utils, dbt-expectations) over custom code where possible. Sanitize any string concatenation that touches column names or table names.
Mitigation: Use dbt's env_var() Jinja macro for all secrets — never hardcode. Configure profiles.yml outside the repo (~/.dbt/profiles.yml or via env vars in CI). Use Vault, AWS Secrets Manager, or equivalent at L5 for warehouse credentials. Never commit profiles.yml to the repo.
Mitigation: Use the same dbt project (Git tag) in staging and production. Tag releases. Use dbt deferral so staging tests run against production-like state. Document the deployment cadence and stick to it.
Every dbt model in Git, every transformation reviewed via PR, every change traceable. Tests assert PHI columns are correctly masked. Run logs satisfy HIPAA access logging when warehouse + orchestrator are BAA-covered. Best-fit use case.
Same dbt project compiles for both warehouses (with adapter-specific macros where needed). Single source of transformation logic. Avoids dual-team-with-dual-tools fragmentation.
dbt is batch-oriented. For real-time, use Materialize, RisingWave, or Apache Flink. dbt can complement (real-time for hot path, dbt for daily reconciliation) but isn't the right primary tool.
dbt Core is fast to start (pip install, dbt init), but the productivity tools (IDE, scheduler, CI integrations) need separate setup. dbt Cloud is hours-faster to first model. Choose dbt Core when you have orchestration already; Cloud when you don't.
This analysis is AI-generated using the INPACT and GOALS frameworks from "Trust Before Intelligence." Scores and assessments are algorithmic and may not reflect the vendor's complete capabilities. Always validate with your own evaluation.