When PostgreSQL runs a query that involves multiple tables, it needs to decide how to join those tables efficiently. The PostgreSQL query planner evaluates different strategies, called join algorithms, to determine the most cost-effective way to execute the query. Choosing the right join algorithm can significantly improve query performance and resource utilization.
In this blog, we will break down the three primary join algorithms used by PostgreSQL, understand how the database engine selects one over the others, explore the configuration parameters that influence these decisions, identify the relevant files in the PostgreSQL source code, and go through SQL examples with explanations.
What Are Join Algorithms?
Join algorithms are methods used by database engines to retrieve and combine rows from two or more tables based on a related column. PostgreSQL has three major join algorithms in its toolkit:
1. Nested Loop Join
2. Merge Join
3. Hash Join
Each of these algorithms has different performance characteristics and is selected based on various factors like data size, indexes, and sort order.
When a query is executed, the PostgreSQL planner evaluates different combinations of join strategies and selects the one that is expected to be the most efficient, based on internal cost estimates.
PostgreSQL Configuration Parameters
PostgreSQL allows users to influence the planner's decisions using parameters in the `postgresql.conf` file. Three such parameters control whether certain join algorithms can be considered at all
Enable or disable specific join types
enable_nestloop = on # Nested Loop Join
enable_mergejoin = on # Merge Join
enable_hashjoin = on # Hash Join
Purpose of Each Parameter:
- enable_nestloop: Controls whether nested loop joins are considered. Useful to disable when you suspect that nested loops are causing slow queries due to large data sets.
- enable_mergejoin: If set to off, merge joins will not be considered by the planner, even if they would have been the most efficient method.
- enable_hashjoin: Similar to the above, disabling this forces PostgreSQL to rely on nested loop or merge join alternatives.
These parameters are especially useful for testing, debugging, or forcing a certain behavior in performance tuning exercises.
Join Algorithm Breakdown
1. Nested Loop Join
How it works:
- For each row in the outer (left) table, PostgreSQL scans the inner (right) table for matching rows. This process is repeated for all rows in the outer table.
Best for:
- Queries involving small tables
- Join operations where the inner table has an index on the join key
Strengths:
- Simple to implement and understand
- Effective when indexes are available on the join column
- Useful when one of the tables is very small
Weaknesses:
- Can be very slow for large tables without indexes
- Performs O(n*m) operations in the worst case
- Best for small tables
- Needs an index on the inner (right) table
- Does not require sorting
- Not efficient on large data
- Simple but becomes very slow with large datasets
Source Code Location: `src/backend/executor/nodeNestloop.c`
-- Example: Nested Loop Join
CREATE TABLE authors (id INT, name TEXT);
CREATE TABLE books (id INT, author_id INT, title TEXT);
INSERT INTO authors VALUES (1, 'jhon'), (2, 'mike');
INSERT INTO books VALUES (1, 1, 'PostgreSQL Guide'), (2, 2, 'SQL Basics');
EXPLAIN ANALYZE
SELECT a.name, b.title
FROM authors a
JOIN books b ON a.id = b.author_id;
2. Merge Join
How it works:
- Sort both tables by the join key and then merge the sorted lists. Similar to the merge step in the merge sort algorithm.
Best for:
- Queries with large data sets that are already sorted or can be efficiently sorted
Strengths:
- Very efficient when inputs are pre-sorted
- Scales well with large, sorted data
Weaknesses:
- Requires sorting if the data is not already ordered
- Might consume more memory if sorting large inputs
- Best for sorted or large datasets
- Index is helpful, but not mandatory
- Requires both inputs to be sorted
- Very fast on large datasets
- Works great when sorted data is already available
Source Code Location: `src/backend/executor/nodeMergejoin.c`
sql
-- Example: Merge Join with indexing
CREATE INDEX ON authors(id);
CREATE INDEX ON books(author_id);
EXPLAIN ANALYZE
SELECT a.name, b.title
FROM authors a
JOIN books b ON a.id = b.author_id;
3. Hash Join
How it works:
- Build a hash table of the smaller table using the join key, then scan the larger table and look up matches in the hash table.
Best for:
- Medium to large datasets with no appropriate indexes
Strengths:
- Does not require indexes or sorted data
- Very efficient in memory for large joins
Weaknesses:
- May consume a lot of memory
- Less predictable performance if memory spills to disk
- Best for medium to large datasets without indexes
- Does not need indexes
- Does not require sorting
- Efficient on large data
- Builds hash table in memory; spills to disk if memory is low
Source Code Location: `src/backend/executor/nodeHashjoin.c`
sql
-- Example: Forcing a Hash Join
SET enable_mergejoin = off;
SET enable_nestloop = off;
EXPLAIN ANALYZE
SELECT a.name, b.title
FROM authors a
JOIN books b ON a.id = b.author_id;
How PostgreSQL Chooses Which Join to Use
PostgreSQL uses a cost-based optimizer that estimates the cost of each possible query plan. It does this by considering:
- The size of each table (number of rows and pages)
- Availability and usefulness of indexes
- Memory and CPU cost estimates
- Data distribution and column statistics
- Estimated selectivity of join conditions
The planner computes a cost for each join method and chooses the plan with the lowest estimated total cost.
You can view which plan was chosen using the `EXPLAIN` and `EXPLAIN ANALYZE` commands:
EXPLAIN ANALYZE
SELECT a.name, b.title
FROM authors a
JOIN books b ON a.id = b.author_id;
Conclusion
Understanding PostgreSQL's join algorithms is essential for writing high-performance SQL queries and optimizing your database's behavior. By learning how each join strategy works, when it is chosen, and how to influence those choices using PostgreSQL's configuration, you gain deeper insight into the query execution process.
Remember to use tools like `EXPLAIN ANALYZE` and temporarily change `enable_*` settings to see how different join methods affect performance.
Want to go deeper into the PostgreSQL source code for planner logic? Explore these files:
- `src/backend/executor/nodeHashjoin.c`: Executes hash joins.
- `src/backend/executor/nodeMergejoin.c`: Executes merge joins.
- `src/backend/executor/nodeNestloop.c`: Executes nested loop joins.