Blog

Find the SQL hidden inside your Python and Airflow pipelines

Q: How do I keep lineage traceable when SQL lives in separate .sql files?

Externalizing SQL into .sql files keeps it reviewable, but it splits each dependency in two: the table names are in the file, the task that runs it is in the Python. Lineage then hinges on resolving which task loads which file — trivial when it's open("daily.sql"), hard when the path is built from a variable or a templated directory. Keep it shallow (one task → one named file, paths as literals) and it stays traceable; hide file selection behind a dynamic loader and you've recreated the dynamic-SQL problem one level up.

June 14, 202610 min read

See how SQL hides inside Python files, Airflow operators, sensors, and ETL jobs, and why this matters for accurate lineage and impact analysis.

Ask a data warehouse what depends on a table and it can only tell you about the SQL it actually ran: views, constraints, declared dependencies. But in most teams, a huge share of the real dependencies never reach the warehouse as clean SQL. They live as strings inside Python — in an Airflow PythonOperator, a Spark job, a custom ingestion script — and get assembled and executed at runtime.

That's why lineage from a catalog tool so often looks suspiciously thin: it can't see the queries that were built in code. If you're trying to understand a pipeline before you change something, you have to go find that hidden SQL yourself. Here's how.

How SQL and Python intersect in modern data pipelines

In most data platforms, Python and SQL split the work: SQL does the actual data manipulation, and Python is the frame around it — orchestrating tasks, building queries, passing connection context, and moving results between steps. An ETL job written in Python and SQL might read a config, assemble a SELECT, hand it to the warehouse, then load the result somewhere else.

That division is productive, but it's also where lineage goes to hide. The dependency — which table this job reads, which one it writes — lives at the seam between the two languages: a SQL string held by Python. The warehouse sees the query only at execution time and keeps no record of how it was built, so the connection between "this Python task" and "that table" exists nowhere you can grep for reliably.

Where the SQL actually hides

In a typical Python pipeline, table dependencies show up in three forms, in rough order of how easy they are to find:

Explicit helpers. Calls like read_table("raw.orders") or write_table("mart.daily_orders"). The table name is right there as a literal string.
Embedded SQL literals. A query written as a string constant and handed to a cursor or a PostgresOperator/SnowflakeOperator: "INSERT INTO mart.x SELECT ... FROM raw.y". The names are present, just wrapped in a longer string.
Dynamic SQL. The query is built at runtime — f"SELECT * FROM {schema}.{table} WHERE dt = '{run_date}'", or string concatenation, or .format(...). The table name may not exist anywhere as a literal at all.

Forms 1 and 2 are findable with patience. Form 3 is where most "I didn't know that job read this table" surprises come from.

Technical approaches to managing SQL inside Python

How the SQL is stored in your code changes how findable its dependencies are. Two patterns dominate.

Reading raw SQL files

A common, clean pattern keeps SQL out of the Python entirely: queries live in .sql files and the code loads them at runtime — open("queries/daily_orders.sql").read() and hand the string to the driver. It keeps SQL reviewable and reusable. For lineage, it adds one hop: the table names are in the .sql files, but which task runs which file is back in the Python. You have to read both halves to connect a query to the job that issues it.

Dynamic generation and query builders

The other pattern builds queries in code — a Python SQL query builder, an ORM expression, or plain f-strings and .format(). Parameterized queries (WHERE id = %s) are good practice for safety, and they keep the structure visible. But once table or column names themselves come from variables, config, or a builder call, the name isn't in the source as a literal anymore — and text search for it returns nothing.

The manual approach (and where it runs out)

The honest first pass is text search. Grep your DAGs and Python modules for table names and for SQL keywords (from, join, insert into, update, merge). For explicit helpers and embedded literals, this gets you most of the way.

Then you hit the limits:

Dynamic SQL defeats text search. If the table name is injected from a variable, a config, or an f-string, searching for the name finds nothing — the name isn't in the source.
Task boundaries get lost. A single DAG file can define a dozen tasks. Knowing which task issues which query — and therefore what runs when — means reading the surrounding Python, not just the SQL line.
Indirection. Queries loaded from .sql files, built by helper functions, or parameterized through Jinja templates are scattered across several places.

You can do all of this by hand. On one script it's fine. Across a DAG-heavy platform it's hours of careful reading, and the cost of missing one query is a broken downstream job.

The Airflow complication: operators and sensors

Airflow adds its own wrinkle, because SQL gets embedded in tasks in a few different shapes. A SQL operator (the PostgresOperator/SnowflakeOperator family) carries a query — inline, from a .sql file, or templated with Jinja using the run's context. A SQL sensor does the opposite: it runs a query and waits until the result meets a condition before the DAG continues, so it reads tables too, just to gate the flow.

To map dependencies in an Airflow estate you therefore have to know, per task, which operator or sensor runs which query — and resolve the Jinja templating that fills in dates and table names at run time. The DAG file tells you the wiring; the SQL tells you the tables; you need both, joined, to know what actually touches what.

A more systematic way: parse the Python, not just the text

The reliable approach is to treat the Python as code rather than as text — read its structure (its syntax tree), find the calls and string constants that carry SQL, and tie each one back to the task it lives in. That:

pulls table reads and writes out of read_table / write_table-style calls;
extracts SQL embedded in string constants and parses that for its own reads and writes;
and, crucially, flags the dynamic SQL it can't fully resolve instead of silently dropping it — so an f-string-built query shows up as "there's a dependency here, and it's dynamic," rather than vanishing.

That last point matters more than it sounds. The dangerous failure isn't "the tool missed something" — it's "the tool told you the dependency list was complete when it wasn't." Honest tooling surfaces the unresolved bits so you know where to look manually.

Where this fits

This is exactly what we built OQEN to do: paste a Python pipeline (or a SQL snippet) and it maps the reads and writes — including SQL embedded in the code — and marks the dynamic queries it can't pin down so you can review them. It's advisory; it shows you the dependencies and leaves the decisions to you. If you've got an Airflow task you're not 100% sure about, try it on a pipeline file.

Once you can see the SQL hiding in your Python, the next question is usually the scary one: what actually breaks if you change one of those tables? That's covered in what breaks before a schema change. And if your buried SQL is Oracle PL/SQL, that's its own special kind of hard — more on parsing PL/SQL here.

FAQ

How do I keep lineage traceable when SQL lives in separate .sql files?

Externalizing SQL into .sql files keeps it reviewable, but it splits each dependency in two: the table names are in the file, the task that runs it is in the Python. Lineage then hinges on resolving which task loads which file — trivial when it's open("daily.sql"), hard when the path is built from a variable or a templated directory. Keep it shallow (one task → one named file, paths as literals) and it stays traceable; hide file selection behind a dynamic loader and you've recreated the dynamic-SQL problem one level up.

What is the difference between an Airflow SQL Operator and an Airflow SQL Sensor?

An operator executes SQL as a step — the modern unified one is SQLExecuteQueryOperator; the dialect-specific PostgresOperator/SnowflakeOperator are now deprecated wrappers. A sensor runs a query repeatedly and blocks the DAG until a condition holds; use mode="reschedule" or a deferrable sensor so it frees the worker slot while waiting. Both read tables, so both are real dependencies — the sensor's are just the ones people forget. Inspect the actual SQL through the task's rendered template, since Jinja fills in tables and dates at run time.

Can a standard Python SQL parser library automatically resolve dynamic f-string queries?

Not in general — doing so is equivalent to running the program. The parser sees f"... {table} ..." as a template, not a value; partial evaluation can resolve names from constants or simple config, but anything driven by runtime state (a variable, a lookup, an Airflow Variable) can't be pinned statically. The right behavior is to surface it as an unresolved dynamic dependency — a map that silently omits what it couldn't evaluate is worse than one that flags it.

How can I build a maintainable data pipeline using Python and SQL without losing data lineage?

Bias toward making dependencies statically visible: keep identifiers as literals near the task that uses them, one logical read/write per task, and parameterize values rather than structure. If you must build SQL in code, do it through something parseable (e.g. sqlglot) instead of raw concatenation, and emit run-time lineage via OpenLineage so the dynamic cases static analysis can't reach are captured from execution. Static plus runtime lineage covers what neither does alone.

What are the best practices for handling a Python parameterized SQL query in Airflow?

Bind values; template identifiers deliberately. Parameter binding (%(x)s, :x) protects values and blocks injection but can't parameterize a table or column name — those come through Jinja templating or explicit, validated string building. Keep templated identifiers on an allow-list, never interpolate untrusted input into structure, and confirm what actually ran via the Rendered Template view (or task.render_template_fields). That keeps the query safe and its real dependencies inspectable.

Why does text search fail when searching for SQL queries in Python codebases?

Because grep matches literals, and pipelines rarely store identifiers as literals. The name is assembled at runtime (f-strings, .format(), "schema." + tbl), aliased (FROM orders o), injected from config or an Airflow Variable, or split across a loader and a .sql file. Search also can't separate a live query from a commented-out one or attribute it to the task that issues it. Fine as a first pass, unreliable as a source of truth.

How do I choose the right Python SQL parser library for tracking data lineage?

Judge it on dialect coverage for your actual warehouses (Snowflake/BigQuery/Postgres/T-SQL), column-level resolution (does it expand SELECT * against a schema or stop at table level?), CTE and subquery scope handling, the ability to extract SQL embedded in Python rather than only parse clean SQL, and how clearly it reports what it couldn't resolve. sqlglot is the common parsing/transpilation building block; the lineage layer built on top is where tools diverge most.

What are the main limitations of using a Python SQL parser for automated SQL lineage mapping?

Static parsing can't resolve identifiers that exist only at runtime, can't expand SELECT * or unqualified columns without the live catalog, loses cross-file assembly, and never sees indirection through reverse-ETL or application code. Templating layers (dbt and Airflow Jinja) have to be compiled first, or the parser reads the template instead of the SQL. Use the static map as the backbone and reconcile the gaps with runtime lineage and the catalog — not as a complete answer by itself.