Introduction

SQL joins are a fundamental concept when working with relational databases. They allow you to combine data from multiple tables based on a related column. In this blog, we will walk through different types of SQL joins using DuckDB and Python, all within a Colab-compatible notebook. We will use synthetic customer and order data, then explore inner joins, left joins, right joins, full outer joins, and even self joins, showing how to extract matching and non-matching records between tables.

Open In Colab

Setting Up: Importing Libraries and Synthesizing Data

We begin by importing the necessary libraries and creating some sample data for customers and orders. This step simulates real-world scenarios where data exists in separate tables, but we need to combine it for analysis.

import duckdb
import pandas as pd

# Synthesizing customer data
customers_data = {
    'customer_id': [1, 2, 3, 4],
    'customer_name': ['Alice', 'Bob', 'Charlie', 'David'],
    'location': ['New York', 'Los Angeles', 'Chicago', 'Los Angeles']
}

# Synthesizing orders data
orders_data = {
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [1, 2, 2, 4, 5],
    'amount': [250, 450, 320, 150, 200]
}

# Creating data frames
customers_df = pd.DataFrame(customers_data)
orders_df = pd.DataFrame(orders_data)

# Display the data frames
print("Customers Data:")
display(customers_df)

print("Orders Data:")
display(orders_df)

# Loading data into DuckDB
duckdb.sql("CREATE TABLE customers AS SELECT * FROM customers_df")
duckdb.sql("CREATE TABLE orders AS SELECT * FROM orders_df")

Inner Join: Returning Only Matching Records

An inner join returns only the rows where there is a match between the two tables. In our case, we’ll match the customer_id from the customers table with the orders table.

# Inner Join: Return only matching records
inner_join_query = """
SELECT customers.customer_name, customers.location, orders.amount
FROM customers
INNER JOIN orders
ON customers.customer_id = orders.customer_id
"""
inner_join_result = duckdb.sql(inner_join_query).df()
print("Inner Join Result:")
display(inner_join_result)

Left Join: Returning All Records from the Left Table

A left join returns all records from the left table (customers) and the matching records from the right table (orders). If there is no match, NULL values are returned.

# Left Join: Return all records from the left table (customers) and matching from right table (orders)
left_join_query = """
SELECT customers.customer_name, customers.location, orders.amount
FROM customers
LEFT JOIN orders
ON customers.customer_id = orders.customer_id
"""
left_join_result = duckdb.sql(left_join_query).df()
print("Left Join Result:")
display(left_join_result)

Left Join: Non-Matching Records Only

To extract only the non-matching rows from a left join, we add a WHERE clause to filter rows where the right table (orders) has no corresponding match.

# Left Join: Return only non-matching records from the left table (customers)
left_join_non_matches_query = """
SELECT customers.customer_name, customers.location, orders.amount
FROM customers
LEFT JOIN orders
ON customers.customer_id = orders.customer_id
WHERE orders.customer_id IS NULL
"""
left_join_non_matches_result = duckdb.sql(left_join_non_matches_query).df()
print("Left Join - Non-Matching Rows Only:")
display(left_join_non_matches_result)

Right Join: Returning All Records from the Right Table

A right join returns all records from the right table (orders) and the matching records from the left table (customers). If there is no match, NULL values are returned for the left table.

# Right Join: Return all records from the right table (orders) and matching from left table (customers)
right_join_query = """
SELECT customers.customer_name, customers.location, orders.amount
FROM customers
RIGHT JOIN orders
ON customers.customer_id = orders.customer_id
"""
right_join_result = duckdb.sql(right_join_query).df()
print("Right Join Result:")
display(right_join_result)

Right Join: Non-Matching Records Only

Similarly, we can use a right join to extract only non-matching rows from the right table by filtering for rows where the left table (customers) has no match.

# Right Join: Return only non-matching records from the right table (orders)
right_join_non_matches_query = """
SELECT customers.customer_name, customers.location, orders.amount
FROM customers
RIGHT JOIN orders
ON customers.customer_id = orders.customer_id
WHERE customers.customer_id IS NULL
"""
right_join_non_matches_result = duckdb.sql(right_join_non_matches_query).df()
print("Right Join - Non-Matching Rows Only:")
display(right_join_non_matches_result)

Full Outer Join: Returning All Records from Both Tables

A full outer join returns all records when there is a match in either left or right table. If there is no match, NULL values are shown for non-matching rows.

# Full Outer Join: Return all records from both tables, showing NULL for non-matches
full_outer_join_query = """
SELECT customers.customer_name, customers.location, orders.amount
FROM customers
FULL OUTER JOIN orders
ON customers.customer_id = orders.customer_id
"""
full_outer_join_result = duckdb.sql(full_outer_join_query).df()
print("Full Outer Join Result:")
display(full_outer_join_result)

Full Outer Join: Non-Matching Records Only

To return only the rows that have no matching records between the two tables, we can add a WHERE clause filtering for NULLs in either table.

# Full Outer Join: Return only non-matching records from both tables
full_outer_join_non_matches_query = """
SELECT customers.customer_name, customers.location, orders.amount
FROM customers
FULL OUTER JOIN orders
ON customers.customer_id = orders.customer_id
WHERE customers.customer_id IS NULL OR orders.customer_id IS NULL
"""
full_outer_join_non_matches_result = duckdb.sql(full_outer_join_non_matches_query).df()
print("Full Outer Join - Non-Matching Rows Only:")
display(full_outer_join_non_matches_result)

Self Join: Comparing Customers in the Same Location

A self join is used when we want to compare rows within the same table. In this example, we use a self join to find customers who are located in the same city.

# Self Join: Comparing customers from the same location
self_join_query = """
SELECT a.customer_name AS customer_1, b.customer_name AS customer_2, a.location
FROM customers a
JOIN customers b
ON a.location = b.location
WHERE a.customer_id <> b.customer_id
"""
self_join_result = duckdb.sql(self_join_query).df()
print("Self Join Result (Customers from the same location):")
display(self_join_result)

Conclusion

This blog demonstrated how to perform different types of joins using DuckDB and Pandas in Python, including inner, left, right, full outer, and self joins. These techniques are critical for combining and filtering data in relational databases, helping analysts draw insights from multiple data sources.