OQEN

Blog

What breaks before a schema change? How to check first

Learn how to check downstream dependencies, views, and database objects before schema changes so teams can avoid broken pipelines and BI reports.


You're about to drop a column. Or rename a table, or change a type from varchar to text. The change itself takes thirty seconds. The fear is everything you can't see: the nightly job three teams over that still selects that column, the dashboard nobody opens until the end of the quarter, the stored procedure that quietly depends on the old name.

That fear is rational. In a mature warehouse, a single table can be read by dozens of pipelines, models, and reports, and most of those dependencies are not written down anywhere. Here's how to find them before you ship — starting with the manual methods that work, where they fall short, and what to do about the gaps.

What can break during a schema change?

A schema change rarely breaks one thing — it breaks a category of things. Before you alter or drop anything, it helps to know the shapes the breakage takes:

  • Readers — queries, views, and jobs that SELECT the column or table. These fail loudly (missing column) or, worse, silently (a join that now returns nothing).
  • Writers — anything that INSERTs or UPDATEs the object. A dropped or retyped column here can corrupt data, not just error.
  • Views and derived models — a view, dbt model, or materialized table built on top of the object; the breakage cascades to everything built on them.
  • Jobs and pipelines — scheduled DAGs, ingestion scripts, and exports, often on a nightly or end-of-quarter cadence you won't notice today.
  • Dashboards and reports — BI tiles that hit the column directly or through a model.
  • Exports and reverse ETL — syncs that push the column out to a CRM or another system, where the failure surfaces far from the warehouse.
  • Embedded SQL — queries built inside application or pipeline code that the database never catalogs (more on these below).

The first three the database can often tell you about. The last four are where surprises live.

What counts as a downstream dependency?

A downstream dependency is anything that consumes the object you're about to change. It's worth splitting them, because they don't fail the same way:

  • Readers consume the data — break them and a query errors or returns wrong results.
  • Writers produce the data — break them and you can corrupt or lose rows.
  • Derived objects (views, models, snapshots) are both: they read the source and become a new dependency themselves, so the impact chains outward.
  • Jobs are readers or writers wrapped in a schedule — the breakage shows up at run time, not deploy time.
  • External consumers (exports, reverse-ETL, downstream apps) sit outside the warehouse entirely, so nothing in the database knows they exist.

"Find the dependencies" really means "find all five kinds" — including the ones the warehouse can't see.

How to find dependencies before a schema change?

The method is three passes, cheapest first.

Step 1: Search the obvious places

Before anything clever, grep. Your transformation code (dbt models, SQL files), your orchestration DAGs (Airflow, Dagster), and your BI definitions are all text. Searching them for the table or column name catches the majority of direct references in a few minutes:

  • the table name, and the column name on its own
  • common aliases the table is given in joins
  • both quoted and unquoted forms

This is fast and it's worth doing every time. It will not, however, find everything — which is the whole problem.

Step 2: Ask the database what it knows

Your warehouse already tracks some dependencies for you. information_schema (and equivalents like Postgres pg_depend or Snowflake's object_dependencies view) can list views and constraints that reference an object. dbt's own lineage graph and exposures will show you model-to-model dependencies and the dashboards you've declared.

These are reliable for what they cover. The catch is that they only cover what the database or your tooling can see directly — and a lot of real dependencies live outside that view.

Step 3: Find the dependencies the easy methods miss

This is where most "I thought I checked" incidents come from. Grep and information_schema both go blind in the same places:

  • SQL embedded in application or pipeline code. A query built inside a Python task, an Airflow PythonOperator, or a Spark job is just a string until it runs. Your warehouse never sees it, so it isn't in information_schema. Grep finds it only if the name appears as a plain literal.
  • Dynamic SQL. Anything assembled with an f-string, string concatenation, or .format()f"select {cols} from {table}" — can't be matched by a simple text search, because the table name isn't in the source as a literal.
  • Cross-system hops. The column feeds a model, which feeds an export, which feeds a reverse-ETL sync into a downstream tool. Each hop is in a different repo or system, so no single search sees the whole chain.

The honest way to handle these today is tedious: read the pipeline code by hand, trace each string back to the table, and follow the chain across systems. On a small change that's doable. On a real migration touching core tables, it's hours of careful work — and the cost of missing one reference is a 2 a.m. page.

A practical pre-change checklist

Whatever tooling you use, the method is the same:

  1. List the direct readers — grep code + query information_schema/pg_depend.
  2. List the indirect readers — trace SQL embedded in pipeline code, including dynamic SQL, and follow cross-system hops one level at a time.
  3. Classify each dependency — does it read the column, or write to it? Reads usually just break; writes can corrupt.
  4. Decide per dependency — update, deprecate with a grace window, or confirm it's dead.
  5. Stage the change — add-then-backfill-then-remove beats a single destructive ALTER.

The slow part is steps 1–2: building the full picture of what actually depends on the thing you're about to change.

Safer rollout patterns for schema changes

Once you know what depends on the object, stage the change so a mistake is recoverable instead of destructive. The pattern that holds up:

  1. Add, don't replace. Introduce the new column or table alongside the old one rather than altering in place.
  2. Backfill. Populate the new shape from the old, and verify the numbers match.
  3. Update consumers. Point readers and writers at the new shape one at a time, not all at once.
  4. Run in parallel. Keep both shapes live long enough to confirm nothing downstream regressed.
  5. Monitor usage. Watch whether anything still reads the old object — that's how you catch the dependency you missed.
  6. Deprecate, then remove. Only after a quiet grace window do you drop the old column or table.

A destructive single ALTER is fast, but it has no undo. Add-backfill-remove turns one irreversible change into a sequence of reversible ones.

Where a tool helps

This is the part automation is genuinely good at: parsing your SQL and your pipeline code to map which tables and columns get read and written, then showing you the blast radius of a proposed change — including the embedded and dynamic SQL that grep can't see.

That's what we built OQEN to do. You can paste a SQL or Python pipeline snippet and see its lineage and impact map in a few seconds. It's advisory — it shows you the dependencies and leaves the decision to you; it never changes your schema. If you're staring down a migration and want a second pair of eyes on what depends on the table you're about to touch, try it on your own SQL.

However you do it — by hand or with help — the rule holds: find every reader before you change the thing they read. The thirty-second ALTER is never the risky part. The part you couldn't see is.

FAQ

What is a downstream dependency in a database?

Anything that consumes the object — but the useful split is by failure mode, not by label. Readers (queries, views, models, dashboards) break or silently return wrong results; writers (INSERT/MERGE targets, upserts) can corrupt data; derived objects are both and chain the impact outward. The ones that bite are outside the catalog's view: SQL embedded in application code, reverse-ETL syncs, and any object whose name is built dynamically — none of which appear in information_schema.

How do I find column dependencies before dropping a column?

Three layers, because none is complete on its own. Catalog: information_schema.view_column_usage, Postgres pg_depend, or Snowflake ACCOUNT_USAGE.OBJECT_DEPENDENCIES (hours of latency, object-level only). Project graph: dbt's compiled DAG and exposures. Code: grep is necessary but blind to dynamic SQL and to column use hidden behind SELECT *. For column-level certainty you need something that parses the SQL and resolves * against the schema, since most catalogs track object-level edges, not columns.

What happens when you drop a column with dependencies?

It depends on the engine and how the dependency was written. Postgres pins view columns at creation, so a bare DROP COLUMN errors unless you CASCADE — which then silently drops the dependent views. Snowflake and BigQuery are laxer and let you drop a column a view still references, so it fails only at the next query. Writers are the dangerous case: a dropped or retyped target can fail a load or, under permissive casts, corrupt rows instead of erroring. SELECT * readers are the quiet ones — they don't error, they just change shape downstream.

What is the safest way to drop or rename a table or column?

Expand–contract (parallel change): add the new shape, dual-write or backfill, migrate readers one at a time behind a view or alias, verify parity, then contract. Treat a rename as add-new + backfill + cut over, never an in-place RENAME that every consumer must change at once. Drive the "safe to remove yet?" call from access history (Snowflake ACCESS_HISTORY, BigQuery INFORMATION_SCHEMA.JOBS), not from belief — leftover readers show up there.


Related reading