Build with complete SQL lineage
clpipe (Column-Level Pipeline) delivers complete lineage by default. It parses your SQL once and builds the complete graph - tables, columns, transformations, and all dependencies - automatically.
Wish column-level lineage? How about it is already built-in?
Your SQL Already Contains Everything
You write SQL files. Tables, columns, transformations, joins - it's all there in your code.
We parse it once. You get the complete graph.
from clpipe import Pipeline
# Point to your SQL files
pipeline = Pipeline.from_sql_files("queries/", dialect="bigquery")
# That's it. Complete graph built automatically.
What's in the graph:
- Tables and their dependencies
- Columns and how they flow
- Transformations (SUM, JOIN, CASE, COALESCE)
- Expressions and calculations
- Metadata from inline comments (descriptions, PII, ownership)
- Everything your SQL already describes
Column lineage built-in. No upgrades. No extra cost.
Your Lineage, Your Control
Other tools lock your lineage in their platform. We give it to you.
from clpipe import Pipeline
# Your lineage lives in your code
pipeline = Pipeline.from_sql_files("queries/", dialect="bigquery")
# It's yours - use it however you want
lineage_json = pipeline.to_json() # Export anywhere
lineage_df = pipeline.to_dataframe() # Analyze in pandas
metadata = pipeline.columns # Query directly
# Integrate with any tool you choose
dag = pipeline.to_airflow_dag(...) # Airflow
# Custom integration with dbt, Dagster, or your own tools
No SaaS lock-in. No forced subscriptions. No vendor dependency.
You choose your tools:
- ✅ Airflow - Built-in support
- ✅ Your orchestrator - Bring your own (dbt, Dagster, Prefect)
- ✅ Your data catalog - Export and integrate
- ✅ Your custom tooling - Full API access
Your lineage stays with you, not locked in our platform.
What You Get Automatically
Question: "Where does dashboard.metrics.total_revenue come from?"
What you know:
dashboard.metrics ← analytics.revenue + dim.customers
analytics.revenue ← staging.orders
staging.orders ← raw.orders
What you DON'T know: - Which columns flow through? - What transformations happened? - Where did PII originate? - What breaks if I change a column?
Time to find answers: Hours of manual SQL reading
Question: "Where does dashboard.metrics.total_revenue come from?"
What you know:
from clpipe import Pipeline
pipeline = Pipeline.from_sql_files("queries/", dialect="bigquery")
# Trace any column instantly
sources = pipeline.trace_column_backward(
"dashboard.metrics", "total_revenue"
)
# Output:
# raw.orders.amount → SUM(amount) → analytics.revenue.total → dashboard.metrics.total_revenue
What you ALSO know (built-in):
- ✅ Ultimate source: raw.orders.amount
- ✅ Transformation: SUM(amount)
- ✅ All intermediate steps
- ✅ PII propagation path
- ✅ Exact impact of changes
Time: 0.003 seconds
Beyond Traditional Lineage
The complete SQL graph unlocks capabilities far beyond just tracing columns:
Automatic Metadata Propagation
Document where you define. Extract automatically.
# Add metadata in your SQL comments
sql = """
SELECT
user_id, -- User ID [pii: false]
email, -- Email [pii: true, owner: data-team]
SUM(revenue) as total /* Total revenue [tags: metric] */
FROM users
GROUP BY user_id, email
"""
pipeline = Pipeline([("query", sql)], dialect="bigquery")
# Metadata extracted automatically
email = pipeline.columns["query.email"]
print(email.description) # "Email"
print(email.pii) # True
print(email.owner) # "data-team"
# Or set programmatically and propagate
pipeline.columns["raw.users.phone"].pii = True
pipeline.propagate_all_metadata()
# Query anywhere
pii_columns = pipeline.get_pii_columns()
metrics = pipeline.get_columns_by_tag("metric")
Governance that scales with your pipeline, not against it.
Precise Impact Analysis
Know exactly what breaks. Not "probably everything".
# Trace forward from ANY column
affected = pipeline.trace_column_forward(
"raw.orders", "amount"
)
# Returns exact downstream impact:
# - staging.orders.total_amount (SUM aggregation)
# - analytics.revenue.total_revenue (direct flow)
# - dashboard.metrics.avg_order_value (calculation input)
Change with confidence.
Context-Aware LLM Documentation
The graph provides context. LLMs provide clarity.
from langchain_openai import ChatOpenAI
pipeline.llm = ChatOpenAI(model="gpt-4")
pipeline.generate_all_descriptions()
# LLM sees:
# - Source columns and types
# - Transformations (SUM, JOIN, CASE)
# - Filter conditions (WHERE clauses)
# - Aggregation logic (GROUP BY)
Documentation that understands your data flow.
Text-to-SQL Without Hallucination
The graph provides real context. No more made-up table names.
# Your graph knows:
# - Actual table names
# - Actual column names
# - Actual relationships
# - Actual transformations
# LLM query with context
query = "Show me total revenue by customer"
# Graph provides:
# - Tables: raw.orders, staging.orders, analytics.customer_metrics
# - Columns: customer_id, amount, total_amount
# - Relationships: customer_id joins, amount → SUM(amount)
# Result: Accurate SQL, no hallucinated columns
Context-aware queries. Zero hallucination.
Pipeline Execution
Graph → DAG. Deploy anywhere.
# Synchronous execution
results = pipeline.run(executor=my_executor, max_workers=4)
# Async execution
results = await pipeline.async_run(executor=my_async_executor)
# Airflow DAG
dag = pipeline.to_airflow_dag(
executor=my_executor,
dag_id="my_pipeline",
schedule="@daily"
)
Write once. Execute everywhere.
Real-World Impact
-
90% Time Savings
Hours → Seconds for tracing data issues
-
10x Faster Documentation
LLM-powered generation with lineage context
-
100% Compliance
Automatic PII propagation
-
~0% Additional Cost
Point to your existing SQL, get everything
Get Started
from clpipe import Pipeline
# Parse your SQL pipeline
pipeline = Pipeline.from_sql_files("queries/", dialect="bigquery")
# The complete graph is built automatically
# - Table lineage: pipeline.table_graph
# - Column lineage: pipeline.columns, pipeline.edges
# Trace any column
sources = pipeline.trace_column_backward("final_table", "metric")
affected = pipeline.trace_column_forward("source_table", "field")
pii_columns = pipeline.get_pii_columns()
What Makes clpipe Different?
| Feature | Traditional Tools | clpipe |
|---|---|---|
| Table lineage | ✅ | ✅ |
| Column lineage | ❌ or Premium | ✅ By Default |
| Transformation tracking | ❌ | ✅ |
| Metadata propagation | ❌ | ✅ |
| PII tracking | Manual | Automatic |
| Impact analysis | Vague | Exact |
| LLM documentation | ❌ | ✅ |
| Additional cost | High | ~0% |
Built by data engineers, for data engineers.