Concepts

Overview

Understanding clpipe comes down to three core concepts:

How we build the graph - From your SQL files to complete lineage
What you can do with it - Table dependencies, pipeline execution, and orchestration
How to document it - Automatic metadata extraction from SQL comments

These concepts work together to give you complete control over your data pipelines.

Core Concepts

From SQL to Lineage Graph

How clpipe works under the hood.

You write SQL files. We parse them into an AST (Abstract Syntax Tree) and extract everything:

Tables and dependencies
Columns and how they flow
Transformations (SUM, JOIN, CASE, etc.)
Complete lineage graph

from clpipe import Pipeline

# Point to your SQL
pipeline = Pipeline.from_sql_files("queries/", dialect="bigquery")

# Everything is in the Pipeline
# - pipeline.table_graph  # Table dependencies
# - pipeline.columns      # Column-level lineage
# - pipeline.edges        # Lineage relationships

What you'll learn: - The three-step process: SQL → AST → Graph - What gets extracted from your SQL - How to query the lineage graph - Why the complete graph matters

Table Lineage & Orchestration

What you can do with the graph.

Once you have the graph, you can:

Understand dependencies - Which tables depend on which
Execute pipelines - Run SQL in topologically sorted order
Generate Airflow DAGs - Automatic task generation with dependencies
Split large pipelines - Create subpipelines by schedule or team

# Execute locally
results = pipeline.run(executor=my_executor, max_workers=4)

# Or generate Airflow DAG
dag = pipeline.to_airflow_dag(
    executor=my_executor,
    dag_id="my_pipeline",
    schedule="@daily"
)

# Or split by schedule
subpipelines = pipeline.split(
    sinks=[["realtime_tables"], ["daily_tables"]]
)

What you'll learn: - Table lineage and dependency tracking - Synchronous and asynchronous execution - Airflow integration - Pipeline splitting strategies

Metadata from Comments

Document your data where you define it.

Add structured metadata directly in your SQL comments:

SELECT
  user_id,  -- User identifier [pii: false]
  email,    -- Email address [pii: true, owner: data-team]
  SUM(revenue) as total  /* Total revenue [tags: metric finance] */
FROM users
GROUP BY user_id, email

clpipe automatically extracts: - Column descriptions - PII flags (with automatic propagation) - Ownership information - Tags and custom metadata

# Access metadata via Pipeline API
email_col = pipeline.columns["query.email"]
print(email_col.description)  # "Email address"
print(email_col.pii)          # True
print(email_col.owner)        # "data-team"

# Find columns by metadata
pii_columns = pipeline.get_pii_columns()
metrics = pipeline.get_columns_by_tag("metric")

What you'll learn: - Comment format and syntax - Supported metadata fields - Metadata propagation rules - Integration with LLM documentation - Best practices

Why We Built This

The current state of lineage tracking.

Most data teams track lineage in one of three ways:

Manual documentation - Spreadsheets, wikis, or diagrams that become outdated immediately
Custom-built parsers - Internal tools that break with every SQL dialect change
Vendor platforms - Enterprise solutions with steep learning curves and vendor lock-in

None of these are on the critical path of delivering business value. Yet they consume significant engineering resources.

Common pitfalls with existing tooling:

Runtime-only tracking - Most tools only track lineage during execution, not during development or before deployment
Steep learning curve - New abstractions and DSLs to learn before tracking lineage
Vendor lock-in - Proprietary formats and APIs that make migration costly
High maintenance cost - Requires dedicated infrastructure and ongoing operational overhead
Diverged engineering resources - Teams spend time on lineage infrastructure instead of data products
Not on the critical path - Lineage tooling doesn't directly contribute to business logic or platform optimization

The problem isn't that lineage is unimportant. The problem is that it shouldn't require this much investment.

We built clpipe to solve this systematically.

By leveraging static analysis on your existing SQL files, you get complete lineage automatically - no infrastructure to maintain, no steep learning curve, no vendor lock-in. Focus your engineering resources on what actually matters: your data products and business logic.

Limitations

We made opinionated decisions to keep clpipe simple and effective.

clpipe is not designed to replace runtime-based lineage systems. Instead, we provide a lightweight alternative that covers most common use cases without the operational overhead. For teams that need runtime statistics, query profiling, or dynamic execution tracking, runtime-based systems remain the better choice.

Runtime Information is Not Tracked

clpipe performs static analysis on SQL files. This means:

❌ Unresolved template variables - SELECT * FROM ${table_name} where the variable isn't resolved before analysis
❌ Database-side dynamic SQL - SQL generated during execution (stored procedures, database templating)
❌ Query statistics - Execution times, row counts, data volumes

This is an intentional design decision.

Static analysis provides significant advantages:

✅ Development-time feedback - Catch issues before deployment
✅ Fast graph building - No database execution required
✅ Version control friendly - Lineage changes tracked in git
✅ Pre-deployment validation - Test lineage locally without infrastructure

For most teams, static lineage provides 95% of the value at 5% of the complexity.

Star Notation Requires Database Connection

When your SQL uses SELECT * or SELECT table.*, clpipe needs schema information to expand the columns:

# Requires database connection for star expansion
pipeline = Pipeline.from_sql_files(
    "queries/",
    dialect="bigquery",
    credentials="path/to/credentials.json"
)

Without a connection, star notations are tracked as a single * column in the lineage graph. This is the only case where clpipe needs database access.

Why? Star notation doesn't explicitly list columns in SQL, so we query the database schema to resolve them.

New to clpipe? Start with From SQL to Lineage Graph to understand how it works.

Ready to use it? Jump to Table Lineage & Orchestration to learn execution and deployment.

Need to document? Read Metadata from Comments to learn automatic metadata extraction.

Want to try it? Head to Quick Start for hands-on examples.

The Big Picture

graph LR
    A[Your SQL Files] -->|Parse| B[AST]
    B -->|Extract| C[Lineage Graph]
    C --> D[Table Dependencies]
    C --> E[Column Lineage]
    D --> F[Pipeline Execution]
    D --> G[Airflow DAGs]
    E --> H[Metadata Propagation]
    E --> I[Impact Analysis]
    E --> J[LLM Documentation]

Your SQL already describes the lineage. We extract it. You use it.

Next Steps

From SQL to Lineage Graph - Understand the foundation
Table Lineage & Orchestration - Learn what you can do
Metadata from Comments - Document your data
Quick Start - Try it yourself
API Documentation - Full reference