Skip to content

How to Write SQL for Perfect Column Lineage

The Core Principle: Column lineage should be obvious from reading the SQL code itself—no need to check table schemas or guess auto-generated column names.

Why it matters: When your SQL is lineage-friendly, anyone (including tools) can trace data flow without database access. This makes code reviews faster, debugging easier, and enables powerful static analysis.


The Problem with Ambiguous SQL

Consider this query:

SELECT
    user_id,
    COUNT(*),
    total,
    id + 1
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY user_id

Quick quiz: What columns are in the output?

Answers:

  • user_id - OK, but from which table? users or orders?
  • COUNT(*) - What's the column name? count, COUNT(*), or something else?
  • total - From orders? Or computed somewhere?
  • id + 1 - What's this column called? id + 1? ?column??

The result: To understand this query, you need to:

  1. Check if both tables have a user_id column
  2. Guess what database auto-names COUNT(*) as
  3. Figure out if total is ambiguous
  4. Know how your database names expressions

This is terrible for code review and impossible for static analysis.


The Solution: 9 Rules for Lineage-Friendly SQL

These rules make column lineage explicit, unambiguous, and traceable without database access.

Note

These are opinionated rules based on our experience building lineage tools and working with complex SQL pipelines. Not all teams will agree with every rule, and that's OK! The goal is to spark conversation about SQL code quality and make lineage analysis easier.

Rule 1: No Duplicate Column Names

Error

Bad:

SELECT
    u.id,
    o.id
FROM users u
JOIN orders o ON u.id = o.user_id

Problem: Which id is which? Most databases error, but even if allowed, it's ambiguous.

Good:

SELECT
    u.id as user_id,
    o.id as order_id
FROM users u
JOIN orders o ON u.id = o.user_id

Why: Each output column has a unique, meaningful name.


Rule 2: Explicit Aliases for Aggregates

Error

Bad:

SELECT
    user_id,
    COUNT(*),
    SUM(total),
    AVG(amount)
FROM orders
GROUP BY user_id

Problem: What are the aggregate columns called? Database-dependent names like count, sum, avg or worse: ?column?

Good:

SELECT
    user_id,
    COUNT(*) as order_count,
    SUM(total) as total_revenue,
    AVG(amount) as avg_order_amount
FROM orders
GROUP BY user_id

Why: Column names are explicit and self-documenting.


Rule 3: Explicit Aliases for Window Functions

Error

Bad:

SELECT
    employee_id,
    salary,
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC)
FROM employees

Problem: What's the window function column called?

Good:

SELECT
    employee_id,
    salary,
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank
FROM employees

Why: Makes the purpose clear and lineage traceable.


Rule 4: Qualified Columns in JOINs

Warning

Bad:

SELECT
    id,
    name,
    total
FROM users u
JOIN orders o ON u.id = o.user_id

Problem: Which table does each column come from? Requires checking schemas.

Good:

SELECT
    u.id as user_id,
    u.name as user_name,
    o.total as order_total
FROM users u
JOIN orders o ON u.id = o.user_id

Even better (if you like qualified names):

SELECT
    u.id,
    u.name,
    o.total
FROM users u
JOIN orders o ON u.id = o.user_id

Why: Lineage is explicit—you can trace each column to its source table without database access.


Rule 5: Explicit Aliases for CASE Expressions

Warning

Bad:

SELECT
    product_id,
    CASE
        WHEN price > 100 THEN 'expensive'
        WHEN price > 50 THEN 'moderate'
        ELSE 'cheap'
    END
FROM products

Problem: What's this column called? case, CASE, or ?column??

Good:

SELECT
    product_id,
    CASE
        WHEN price > 100 THEN 'expensive'
        WHEN price > 50 THEN 'moderate'
        ELSE 'cheap'
    END as price_category
FROM products

Why: Self-documenting and traceable.


Rule 6: Explicit Aliases for Scalar Subqueries

Warning

Bad:

SELECT
    u.id,
    u.name,
    (SELECT COUNT(*) FROM orders WHERE user_id = u.id)
FROM users u

Problem: What's the subquery column called?

Good:

SELECT
    u.id,
    u.name,
    (SELECT COUNT(*) FROM orders WHERE user_id = u.id) as order_count
FROM users u

Why: Clear intent and traceable lineage.


Rule 7: Explicit Aliases for Literals

Warning

Bad:

SELECT
    'v1.0',
    CURRENT_TIMESTAMP(),
    1
FROM table1

Problem: What are these columns called?

Good:

SELECT
    'v1.0' as version,
    CURRENT_TIMESTAMP() as query_timestamp,
    1 as partition_id
FROM table1

Why: Documents intent and makes lineage clear.


Rule 8: Explicit Aliases for Arithmetic

Warning

Bad:

SELECT
    price * 1.1,
    quantity * price,
    revenue - cost
FROM products

Problem: Auto-generated names like price * 1.1 are ugly and unclear.

Good:

SELECT
    price * 1.1 as price_with_tax,
    quantity * price as line_total,
    revenue - cost as profit
FROM products

Why: Meaningful names that explain the business logic.


Rule 9: Explicit Aliases for Type Casts (Optional)

Warning

Bad:

SELECT
    CAST(created_at AS DATE),
    SAFE_CAST(amount AS INT64)
FROM orders

Problem: Column names might be auto-generated.

Good:

SELECT
    CAST(created_at AS DATE) as order_date,
    SAFE_CAST(amount AS INT64) as amount_int
FROM orders

Why: Explicit intent (though some teams allow unaliased casts if the base column name is preserved).


The Benefits: Why This Matters

1. Better Code Reviews

Before (ambiguous SQL):

SELECT id, COUNT(*), total + 10 FROM ...
Reviewer: "Wait, which id? What's COUNT(*) called? What does total + 10 represent?"

After (lineage-friendly):

SELECT u.id as user_id, COUNT(*) as order_count, total + 10 as adjusted_total FROM ...
Reviewer: "Clear!"

2. Static Analysis Without Database

Lineage-friendly SQL enables:

  • Impact analysis without running queries
  • Automatic documentation generation
  • Safe refactoring (know exactly what depends on what)
  • Faster CI/CD (no database needed for validation)

3. Self-Documenting Code

Compare:

-- Bad: What does this do?
SELECT id, COUNT(*), AVG(x), y + z FROM ...

-- Good: Obvious intent
SELECT
    user_id,
    COUNT(*) as total_orders,
    AVG(order_amount) as avg_order_value,
    revenue + tax as total_income
FROM ...

4. Easier Debugging

When production breaks:

Ambiguous SQL: "Where does this column come from? Let me check 5 different tables..."

Lineage-friendly SQL: "It's orders.total, traced through monthly_totals.sum_total to output.revenue. Found the bug!"


Real-World Example: Before & After

Before (Ambiguous)

WITH base AS (
    SELECT *, COUNT(*) FROM products p JOIN sales s ON p.id = s.product_id GROUP BY p.id
),
enriched AS (
    SELECT id, COUNT(*), SUM(amount) FROM base JOIN users ON id = user_id
)
SELECT * FROM enriched

Problems:

  • Which id in enriched? From base or users?
  • What's COUNT(*) called in each CTE?
  • Which columns pass through SELECT *?
  • Impossible to trace lineage without database

After (Lineage-Friendly)

WITH base AS (
    SELECT
        p.id as product_id,
        p.name as product_name,
        p.category,
        COUNT(*) as sale_count
    FROM products p
    JOIN sales s ON p.id = s.product_id
    GROUP BY p.id, p.name, p.category
),
enriched AS (
    SELECT
        b.product_id,
        b.product_name,
        b.category,
        b.sale_count,
        COUNT(u.id) as user_count,
        SUM(u.amount) as total_user_amount
    FROM base b
    JOIN users u ON b.product_id = u.favorite_product_id
    GROUP BY b.product_id, b.product_name, b.category, b.sale_count
)
SELECT
    product_id,
    product_name,
    category,
    sale_count,
    user_count,
    total_user_amount
FROM enriched

Benefits:

  • Every column explicitly named
  • Clear source for each column
  • Perfect lineage without database access
  • Self-documenting intent

Severity Levels: How Serious Are These Rules?

Not all rules are equally important. Here's how we think about their severity.

Critical Issues (Must Fix)

  • Duplicate column names
  • Missing aliases for aggregates
  • Missing aliases for window functions

Why critical: Breaks most databases or creates genuinely ambiguous output that's impossible to trace.

Important Issues (Should Fix)

  • Unqualified columns in JOINs
  • Missing aliases for CASE expressions
  • Missing aliases for scalar subqueries
  • Missing aliases for literals
  • Missing aliases for arithmetic

Why important: Works in databases, but makes lineage analysis harder and code less readable. Reviewers need to check schemas to understand the query.

Minor Issues (Optional)

  • Missing aliases for simple casts
  • Style consistency

Why minor: Debatable best practices, team preference.


Example: Applying These Rules

Let's walk through fixing a problematic query step-by-step:

Original (problematic):

SELECT
    id,
    COUNT(*),
    total + 10
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY id

Issues identified:

  1. Line 2 - id is unqualified: Which table? users or orders?
  2. Line 3 - COUNT(*) has no alias: What will this column be called?
  3. Line 4 - total + 10 has no alias: Auto-generated name will be unclear
  4. Line 6 - GROUP BY id: Ambiguous, which id?

Fixed version:

SELECT
    u.id as user_id,
    COUNT(*) as order_count,
    o.total + 10 as adjusted_total
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY u.id

What improved:

  • Every column is explicitly named and qualified
  • Lineage is traceable without database access
  • Code review is straightforward
  • Intent is self-documenting

Best Practices Summary

The Golden Rule: Every expression in SELECT should have a clear, explicit name.

Quick checklist for code review:

  • [ ] All aggregates aliased?
  • [ ] All window functions aliased?
  • [ ] Columns qualified in JOINs?
  • [ ] CASE expressions aliased?
  • [ ] Arithmetic/expressions aliased?
  • [ ] No duplicate column names?

Adapting These Rules to Your Team

These rules are opinionated! Here's how to adapt them:

  • Strict teams: Treat all recommendations as mandatory
  • Pragmatic teams: Require fixes for truly ambiguous SQL, encourage fixes for readability issues
  • Lenient teams: Only enforce no-duplicates rule, treat others as suggestions
  • Custom teams: Pick and choose the rules that make sense for your context

The key insight: Whatever rules you choose, be consistent. Consistency makes lineage predictable and code reviewable.


Why This Approach Works

Traditional approach: Write SQL however, then use tools to reverse-engineer lineage from query logs

Lineage-first approach: Write SQL so clearly that lineage is obvious from the code itself

The difference:

  • Speed: Immediate understanding vs hours of analysis
  • Accuracy: Perfect lineage vs best-guess lineage
  • Cost: No tooling needed vs expensive monitoring systems
  • Reliability: Works for all SQL vs breaks on edge cases
  • Collaboration: Anyone can trace lineage vs requires specialized tools

What's Next?

Once your SQL is lineage-friendly:

  • Trace columns through complex pipelines instantly
  • Do impact analysis in seconds, not days
  • Enable auto-maintained data catalogs
  • Power AI-driven SQL generation with perfect context

Start Applying These Principles

Take action today:

  1. Pick one SQL file from your codebase
  2. Review it against these 9 rules
  3. Refactor to make lineage explicit
  4. Share the before/after with your team
  5. Discuss which rules make sense for your context

Get Started with clpipe View on GitHub


Previous: Why Column Lineage Should Be Built-In, Not Bolted-On