Blog

Why parsing PL/SQL is hard, and how to map what it touches

June 14, 20268 min read

Understand why SQL and PL/SQL parsing is difficult, how parse trees support lineage, and how to map tables, columns, and downstream dependencies.

If you've ever tried to feed a real Oracle package into a lineage or SQL-analysis tool, you've probably watched it choke. Plain SELECT/INSERT parsing is a solved problem; tools handle it all day. PL/SQL is a different animal, and most tools that claim "SQL support" quietly fall apart the moment they meet a CREATE PACKAGE BODY. If you work in an Oracle-heavy shop — and a lot of banks and insurers still do — this is a real and lonely problem.

Here's why it's hard, and what a reliable approach actually has to do.

How a standard SQL parser builds a SQL parse tree

Before getting into why PL/SQL is hard, it helps to see what "parsing SQL" normally means. A SQL parser works in two stages. First, lexing (tokenization) breaks the raw text into tokens — keywords like SELECT and FROM, identifiers (table and column names), operators, and literals. Then parsing arranges those tokens according to the language's grammar into a tree — a SQL parse tree, or abstract syntax tree (AST) — that captures structure: this is a SELECT, these are its columns, this is the FROM clause, that's a join.

That tree is the thing analysis runs on. Once you have it, "which tables does this query read?" becomes a structured question — walk the FROM and JOIN nodes — instead of a text-matching guess. The trouble starts when the language doesn't fit the grammar the parser expects.

Why PL/SQL breaks ordinary SQL parsers

A regular SQL parser expects statements: one query, a semicolon, the next query. PL/SQL isn't shaped like that.

It's procedural, not just declarative. A package body is a program — variables, loops, IF branches, exception handlers, nested procedures and functions — with SQL woven through it. You can't just split on semicolons; semicolons appear inside blocks.
Block boundaries are genuinely ambiguous. Where does a procedure end? END; versus END procedure_name; versus the END package_name; that closes the whole thing — getting this wrong means you either merge two procedures or cut one in half.
The same keywords mean different things. SET could be a transaction control or a session setting. ALTER SESSION is not ALTER TABLE. A naive parser misclassifies these constantly.
Dynamic SQL is everywhere. EXECUTE IMMEDIATE 'INSERT INTO ' || v_table || ' ...' builds a statement at runtime. The table name isn't in the code as a literal — it's in a variable.
Oracle-specific syntax. MERGE, CONNECT BY, BULK COLLECT, conditional compilation flags, LOG ERRORS INTO, and a pile of SQL*Plus directives that aren't really statements at all.

Most tools handle a slice of this and give up on the rest — usually silently, which is the worst outcome, because you can't tell what they missed.

Extracting automated data lineage from SQL queries

Data lineage from a SQL query falls out of that parse tree. Walk the AST and you can separate sources from targets: tables under FROM, JOIN, and USING are reads; the table behind INSERT INTO, UPDATE, or MERGE INTO is a write. Follow column references through the SELECT list and you can push the same source-to-target mapping down to the column level — this output column came from that input column, through this transformation.

Do it across every statement in a script and the individual edges compose into a graph: raw tables feed staging models, which feed marts, which feed reports. That graph is the lineage — and it's only as complete as the parse underneath it, which is exactly why a language the parser chokes on leaves holes in the map.

What a reliable approach has to do

To actually map what a PL/SQL block reads and writes, you need to parse it the way Oracle does — with a real grammar for the language, not pattern-matching:

Walk the real parse tree. Use an actual PL/SQL grammar so procedures, packages, triggers, and anonymous DECLARE ... BEGIN blocks are understood as structures, with correct boundaries.
Classify every statement honestly. Separate data changes (INSERT/UPDATE/MERGE) from definitions (CREATE/ALTER), transaction control (COMMIT/ROLLBACK), access control (GRANT), and session/utility commands — so the dependency map reflects what actually reads and writes data, not noise.
Derive reads vs writes from context. A table in INSERT INTO ... is a write target; the same table in a FROM or a MERGE ... USING is a read. Getting the direction right is what makes the map trustworthy, because writes are the dangerous ones.
Be honest about dynamic SQL. When a statement is built from variables at runtime, a parser fundamentally cannot see inside the string. The right behavior is to flag it as a dynamic dependency that needs a human's eye — not to drop it and pretend the map is complete.

That fourth point is the integrity test for any tool in this space. PL/SQL leans on dynamic SQL heavily, and no parser can fully resolve a table name that only exists as a runtime variable. A tool you can trust tells you where it couldn't be sure.

The importance of SQL lineage visualization

A lineage graph is only useful if a human can read it under pressure. When you're assessing the blast radius of a change, a dense table of source → target rows is hard to reason about; a visual graph of how tables and columns connect lets you see, at a glance, what sits downstream of the object you're about to touch and how far the impact reaches.

That's the practical value of SQL lineage visualization: it turns "trace every dependency by hand" into "look at the picture and find the ones that matter." The map has to be trustworthy first — but once it is, seeing it beats reading it.

Where this fits

We built OQEN on a real Oracle/PL/SQL grammar for exactly this reason: to take a procedure, package, or anonymous block and map the tables it reads and writes — classifying statements correctly and flagging the dynamic SQL it can't pin down rather than hiding it. It's advisory: it shows you the dependency picture and leaves the call to you; it never changes anything. If you've got a gnarly Oracle package you'd like a second read on, try it on a PL/SQL block.

PL/SQL is just one place SQL hides, though. If yours lives inside application code instead of the database, see finding the SQL hidden in Python and Airflow. And once you can see what a block touches, the practical question is what a change would break — that's here.

FAQ

What is a SQL parse tree and why is it important for data lineage?

It's the grammar-structured form of a query — the parser builds a concrete syntax tree, the analyzer walks an AST. But the tree alone isn't enough: lineage also needs a resolution (binding) phase that ties identifiers to real objects — qualifying unqualified columns, expanding SELECT * against the schema, resolving CTE and alias scope. Parsing gives you the shape; binding tells you which actual table or column each node means, and lineage needs both.

How do automated SQL lineage tools extract data flow from a complex SQL query?

They parse to a tree, resolve scopes (CTEs, derived tables, aliases, subqueries), then propagate columns from sources (FROM/JOIN) to targets (INSERT/UPDATE/MERGE), composing per-statement edges into a graph. The hard parts are SELECT * and unqualified columns (need the catalog to resolve), set operations and window functions (column identity across branches), and MERGE (it reads and writes the same table). Whatever can't be resolved should be flagged, not dropped.

Why do standard data lineage tools with SQL support often fail when parsing PL/SQL packages?

Because they're built for statement-list SQL, and PL/SQL is a procedural language: package specs and bodies, nested procedures, cursors, %TYPE/%ROWTYPE, exception handlers, conditional compilation ($IF), and dynamic execution via EXECUTE IMMEDIATE / DBMS_SQL. A grammar that handles SELECT/INSERT chokes on block boundaries and control flow, so tools commonly parse the embedded straight SQL and skip the procedural envelope — which is exactly where many of the reads and writes live.

How does SQL lineage visualization help in database impact analysis?

Impact analysis is a transitive-closure question — not "what reads this column" but "what, several hops downstream, ultimately depends on it." A graph makes that closure legible and lets you trace blast radius and spot the load-bearing nodes far faster than reading dependency rows. Column-level graphs are the ones worth having; object-level graphs over-report, implying impact whenever any column of a table changes.

What features should you look for when choosing data lineage tools with SQL support?

Real grammar coverage for your dialects (including PL/SQL or T-SQL if you have them), column-level resolution with SELECT * expansion, extraction of SQL embedded in code, correct read/write direction including MERGE, compilation of templating layers (dbt/Airflow Jinja) before parsing, and explicit reporting of what it couldn't resolve. That last one is the integrity test — a tool that never says "I'm not sure" is hiding its misses.