How to Write SQL for Perfect Column Lineage
The Core Principle: Column lineage should be obvious from reading the SQL code itself—no need to check table schemas or guess auto-generated column names.
Why it matters: When your SQL is lineage-friendly, anyone (including tools) can trace data flow without database access. This makes code reviews faster, debugging easier, and enables powerful static analysis.
The Problem with Ambiguous SQL
Consider this query:
SELECT
user_id,
COUNT(*),
total,
id + 1
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY user_id
Quick quiz: What columns are in the output?
Answers:
user_id- OK, but from which table?usersororders?COUNT(*)- What's the column name?count,COUNT(*), or something else?total- Fromorders? Or computed somewhere?id + 1- What's this column called?id + 1??column??
The result: To understand this query, you need to:
- Check if both tables have a
user_idcolumn - Guess what database auto-names
COUNT(*)as - Figure out if
totalis ambiguous - Know how your database names expressions
This is terrible for code review and impossible for static analysis.
The Solution: 9 Rules for Lineage-Friendly SQL
These rules make column lineage explicit, unambiguous, and traceable without database access.
Note
These are opinionated rules based on our experience building lineage tools and working with complex SQL pipelines. Not all teams will agree with every rule, and that's OK! The goal is to spark conversation about SQL code quality and make lineage analysis easier.
Rule 1: No Duplicate Column Names
Error
Bad:
Problem: Which id is which? Most databases error, but even if allowed, it's ambiguous.
Good:
Why: Each output column has a unique, meaningful name.
Rule 2: Explicit Aliases for Aggregates
Error
Bad:
Problem: What are the aggregate columns called? Database-dependent names like count, sum, avg or worse: ?column?
Good:
SELECT
user_id,
COUNT(*) as order_count,
SUM(total) as total_revenue,
AVG(amount) as avg_order_amount
FROM orders
GROUP BY user_id
Why: Column names are explicit and self-documenting.
Rule 3: Explicit Aliases for Window Functions
Error
Bad:
SELECT
employee_id,
salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC)
FROM employees
Problem: What's the window function column called?
Good:
SELECT
employee_id,
salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank
FROM employees
Why: Makes the purpose clear and lineage traceable.
Rule 4: Qualified Columns in JOINs
Warning
Bad:
Problem: Which table does each column come from? Requires checking schemas.
Good:
SELECT
u.id as user_id,
u.name as user_name,
o.total as order_total
FROM users u
JOIN orders o ON u.id = o.user_id
Even better (if you like qualified names):
Why: Lineage is explicit—you can trace each column to its source table without database access.
Rule 5: Explicit Aliases for CASE Expressions
Warning
Bad:
SELECT
product_id,
CASE
WHEN price > 100 THEN 'expensive'
WHEN price > 50 THEN 'moderate'
ELSE 'cheap'
END
FROM products
Problem: What's this column called? case, CASE, or ?column??
Good:
SELECT
product_id,
CASE
WHEN price > 100 THEN 'expensive'
WHEN price > 50 THEN 'moderate'
ELSE 'cheap'
END as price_category
FROM products
Why: Self-documenting and traceable.
Rule 6: Explicit Aliases for Scalar Subqueries
Warning
Bad:
Problem: What's the subquery column called?
Good:
Why: Clear intent and traceable lineage.
Rule 7: Explicit Aliases for Literals
Warning
Bad:
Problem: What are these columns called?
Good:
Why: Documents intent and makes lineage clear.
Rule 8: Explicit Aliases for Arithmetic
Warning
Bad:
Problem: Auto-generated names like price * 1.1 are ugly and unclear.
Good:
SELECT
price * 1.1 as price_with_tax,
quantity * price as line_total,
revenue - cost as profit
FROM products
Why: Meaningful names that explain the business logic.
Rule 9: Explicit Aliases for Type Casts (Optional)
Warning
Bad:
Problem: Column names might be auto-generated.
Good:
Why: Explicit intent (though some teams allow unaliased casts if the base column name is preserved).
The Benefits: Why This Matters
1. Better Code Reviews
Before (ambiguous SQL):
Reviewer: "Wait, whichid? What's COUNT(*) called? What does total + 10 represent?"
After (lineage-friendly):
Reviewer: "Clear!"2. Static Analysis Without Database
Lineage-friendly SQL enables:
- Impact analysis without running queries
- Automatic documentation generation
- Safe refactoring (know exactly what depends on what)
- Faster CI/CD (no database needed for validation)
3. Self-Documenting Code
Compare:
-- Bad: What does this do?
SELECT id, COUNT(*), AVG(x), y + z FROM ...
-- Good: Obvious intent
SELECT
user_id,
COUNT(*) as total_orders,
AVG(order_amount) as avg_order_value,
revenue + tax as total_income
FROM ...
4. Easier Debugging
When production breaks:
Ambiguous SQL: "Where does this column come from? Let me check 5 different tables..."
Lineage-friendly SQL: "It's orders.total, traced through monthly_totals.sum_total to output.revenue. Found the bug!"
Real-World Example: Before & After
Before (Ambiguous)
WITH base AS (
SELECT *, COUNT(*) FROM products p JOIN sales s ON p.id = s.product_id GROUP BY p.id
),
enriched AS (
SELECT id, COUNT(*), SUM(amount) FROM base JOIN users ON id = user_id
)
SELECT * FROM enriched
Problems:
- Which
idin enriched? Frombaseorusers? - What's
COUNT(*)called in each CTE? - Which columns pass through
SELECT *? - Impossible to trace lineage without database
After (Lineage-Friendly)
WITH base AS (
SELECT
p.id as product_id,
p.name as product_name,
p.category,
COUNT(*) as sale_count
FROM products p
JOIN sales s ON p.id = s.product_id
GROUP BY p.id, p.name, p.category
),
enriched AS (
SELECT
b.product_id,
b.product_name,
b.category,
b.sale_count,
COUNT(u.id) as user_count,
SUM(u.amount) as total_user_amount
FROM base b
JOIN users u ON b.product_id = u.favorite_product_id
GROUP BY b.product_id, b.product_name, b.category, b.sale_count
)
SELECT
product_id,
product_name,
category,
sale_count,
user_count,
total_user_amount
FROM enriched
Benefits:
- Every column explicitly named
- Clear source for each column
- Perfect lineage without database access
- Self-documenting intent
Severity Levels: How Serious Are These Rules?
Not all rules are equally important. Here's how we think about their severity.
Critical Issues (Must Fix)
- Duplicate column names
- Missing aliases for aggregates
- Missing aliases for window functions
Why critical: Breaks most databases or creates genuinely ambiguous output that's impossible to trace.
Important Issues (Should Fix)
- Unqualified columns in JOINs
- Missing aliases for CASE expressions
- Missing aliases for scalar subqueries
- Missing aliases for literals
- Missing aliases for arithmetic
Why important: Works in databases, but makes lineage analysis harder and code less readable. Reviewers need to check schemas to understand the query.
Minor Issues (Optional)
- Missing aliases for simple casts
- Style consistency
Why minor: Debatable best practices, team preference.
Example: Applying These Rules
Let's walk through fixing a problematic query step-by-step:
Original (problematic):
Issues identified:
- Line 2 -
idis unqualified: Which table?usersororders? - Line 3 -
COUNT(*)has no alias: What will this column be called? - Line 4 -
total + 10has no alias: Auto-generated name will be unclear - Line 6 -
GROUP BY id: Ambiguous, whichid?
Fixed version:
SELECT
u.id as user_id,
COUNT(*) as order_count,
o.total + 10 as adjusted_total
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY u.id
What improved:
- Every column is explicitly named and qualified
- Lineage is traceable without database access
- Code review is straightforward
- Intent is self-documenting
Best Practices Summary
The Golden Rule: Every expression in SELECT should have a clear, explicit name.
Quick checklist for code review:
- [ ] All aggregates aliased?
- [ ] All window functions aliased?
- [ ] Columns qualified in JOINs?
- [ ] CASE expressions aliased?
- [ ] Arithmetic/expressions aliased?
- [ ] No duplicate column names?
Adapting These Rules to Your Team
These rules are opinionated! Here's how to adapt them:
- Strict teams: Treat all recommendations as mandatory
- Pragmatic teams: Require fixes for truly ambiguous SQL, encourage fixes for readability issues
- Lenient teams: Only enforce no-duplicates rule, treat others as suggestions
- Custom teams: Pick and choose the rules that make sense for your context
The key insight: Whatever rules you choose, be consistent. Consistency makes lineage predictable and code reviewable.
Why This Approach Works
Traditional approach: Write SQL however, then use tools to reverse-engineer lineage from query logs
Lineage-first approach: Write SQL so clearly that lineage is obvious from the code itself
The difference:
- Speed: Immediate understanding vs hours of analysis
- Accuracy: Perfect lineage vs best-guess lineage
- Cost: No tooling needed vs expensive monitoring systems
- Reliability: Works for all SQL vs breaks on edge cases
- Collaboration: Anyone can trace lineage vs requires specialized tools
What's Next?
Once your SQL is lineage-friendly:
- Trace columns through complex pipelines instantly
- Do impact analysis in seconds, not days
- Enable auto-maintained data catalogs
- Power AI-driven SQL generation with perfect context
Start Applying These Principles
Take action today:
- Pick one SQL file from your codebase
- Review it against these 9 rules
- Refactor to make lineage explicit
- Share the before/after with your team
- Discuss which rules make sense for your context
Get Started with clpipe View on GitHub
Previous: Why Column Lineage Should Be Built-In, Not Bolted-On