Skip to content

Getting Started

Welcome to clpipe! Get up and running in minutes.


What You'll Learn

This guide will help you:

  1. Install clpipe - Set up in your environment
  2. Parse SQL - Build your first lineage graph
  3. Explore lineage - Understand table and column dependencies
  4. Execute pipelines - Run SQL or generate Airflow DAGs
  5. Use advanced features - Metadata propagation, LLM documentation, pipeline splitting

Installation

Install from GitHub repository:

pip install git+https://github.com/clpipe/clpipe.git

Takes 2 minutes. Full guide →


Quick Start

5-minute tutorial to build your first pipeline:

from clpipe import Pipeline

# Parse SQL
pipeline = Pipeline.from_sql_files("queries/", dialect="bigquery")

# Explore lineage
tables = pipeline.table_graph.tables
sources = pipeline.trace_column_backward("table", "column")

# Execute
results = pipeline.run(executor=my_executor)

Full tutorial →


Examples

Real-world use cases:

  • PII Compliance Audit - Find all sensitive data
  • Impact Analysis - Know what breaks before making changes
  • Multi-Schedule Pipelines - Different frequencies for different tables
  • LLM Documentation - Auto-generate descriptions
  • Root Cause Analysis - Trace data issues back to source

See all examples →


Learning Path

For New Users

  1. Install - Get clpipe set up
  2. Quick Start - Build your first pipeline
  3. Concepts: From SQL to Lineage Graph - Understand how it works

For Existing Projects

  1. Install - Add to your project
  2. Examples - Find your use case
  3. Concepts: Table Lineage & Orchestration - Learn execution patterns

For Production Deployment

  1. Quick Start: Generate Airflow DAG - Create production DAG
  2. Examples: Multi-Schedule Pipeline - Split by frequency
  3. Examples: Team-Based Split - Organize by ownership

Key Features

Built-In Column Lineage

# Trace any column to its source
sources = pipeline.trace_column_backward(
    "analytics.customer_metrics",
    "avg_order_value"
)

# Output: Complete path from raw data to final metric

No configuration required. Column lineage is built automatically when you parse SQL.

Automatic Metadata Propagation

# Set PII once at source
pipeline.columns["raw.users.email"].pii = True

# Propagates through entire pipeline
pipeline.propagate_all_metadata()

# Query anywhere
pii_columns = pipeline.get_pii_columns()

Tags, ownership, and PII markers flow automatically through joins and transformations.

Multiple Execution Modes

# Local execution
results = pipeline.run(executor=my_executor, max_workers=4)

# Async execution
results = await pipeline.async_run(executor=my_async_executor)

# Airflow DAG
dag = pipeline.to_airflow_dag(executor=my_executor, dag_id="pipeline")

Write once, deploy anywhere.

No Vendor Lock-In

# Your lineage lives in your code
lineage_json = pipeline.to_json()

# Export to any format
lineage_df = pipeline.to_dataframe()

# Integrate with any tool

You own the graph. Not locked into any platform.


Common Questions

Do I need to change my SQL?

No. clpipe works with your existing SQL files. No annotations, no special syntax.

What databases are supported?

BigQuery, Snowflake, PostgreSQL, DuckDB, Redshift, and many more.

Can I use it with Airflow?

Yes. Generate Airflow DAGs automatically with pipeline.to_airflow_dag().

Does it work with large pipelines?

Yes. Tested on 1,000+ queries and 10,000+ columns. Parse time <5 seconds.

Is it open source?

Yes. MIT license. View on GitHub.


Need Help?


Next Steps

Ready to dive in?

Or explore the fundamentals: