How PostgreSQL Chooses Join Algorithms

When PostgreSQL runs a query that involves multiple tables, it needs to decide how to join those tables efficiently. The PostgreSQL query planner evaluates different strategies, called join algorithms, to determine the most cost-effective way to execute the query. Choosing the right join algorithm can significantly improve query performance and resource utilization.

In this blog, we will break down the three primary join algorithms used by PostgreSQL, understand how the database engine selects one over the others, explore the configuration parameters that influence these decisions, identify the relevant files in the PostgreSQL source code, and go through SQL examples with explanations. 

What Are Join Algorithms?

Join algorithms are methods used by database engines to retrieve and combine rows from two or more tables based on a related column. PostgreSQL has three major join algorithms in its toolkit:

1. Nested Loop Join

2. Merge Join

3. Hash Join

Each of these algorithms has different performance characteristics and is selected based on various factors like data size, indexes, and sort order.

When a query is executed, the PostgreSQL planner evaluates different combinations of join strategies and selects the one that is expected to be the most efficient, based on internal cost estimates.

PostgreSQL Configuration Parameters

PostgreSQL allows users to influence the planner's decisions using parameters in the `postgresql.conf` file. Three such parameters control whether certain join algorithms can be considered at all

Enable or disable specific join types

enable_nestloop = on      # Nested Loop Join
enable_mergejoin = on     # Merge Join
enable_hashjoin  = on     # Hash Join

Purpose of Each Parameter:

- enable_nestloop: Controls whether nested loop joins are considered. Useful to disable when you suspect that nested loops are causing slow queries due to large data sets.

- enable_mergejoin: If set to off, merge joins will not be considered by the planner, even if they would have been the most efficient method.

- enable_hashjoin: Similar to the above, disabling this forces PostgreSQL to rely on nested loop or merge join alternatives.

These parameters are especially useful for testing, debugging, or forcing a certain behavior in performance tuning exercises.

 Join Algorithm Breakdown

 1. Nested Loop Join

How it works:

- For each row in the outer (left) table, PostgreSQL scans the inner (right) table for matching rows. This process is repeated for all rows in the outer table.

Best for:

- Queries involving small tables

- Join operations where the inner table has an index on the join key

Strengths:

- Simple to implement and understand

- Effective when indexes are available on the join column

- Useful when one of the tables is very small

Weaknesses:

- Can be very slow for large tables without indexes

- Performs O(n*m) operations in the worst case

- Best for small tables  

- Needs an index on the inner (right) table  

- Does not require sorting  

- Not efficient on large data  

- Simple but becomes very slow with large datasets  

Source Code Location: `src/backend/executor/nodeNestloop.c`

-- Example: Nested Loop Join
CREATE TABLE authors (id INT, name TEXT);
CREATE TABLE books (id INT, author_id INT, title TEXT);
INSERT INTO authors VALUES (1, 'jhon'), (2, 'mike');
INSERT INTO books VALUES (1, 1, 'PostgreSQL Guide'), (2, 2, 'SQL Basics');
EXPLAIN ANALYZE
SELECT a.name, b.title
FROM authors a
JOIN books b ON a.id = b.author_id;

2. Merge Join

How it works:

- Sort both tables by the join key and then merge the sorted lists. Similar to the merge step in the merge sort algorithm.

Best for:

- Queries with large data sets that are already sorted or can be efficiently sorted

Strengths:

- Very efficient when inputs are pre-sorted

- Scales well with large, sorted data

Weaknesses:

- Requires sorting if the data is not already ordered

- Might consume more memory if sorting large inputs

- Best for sorted or large datasets  

- Index is helpful, but not mandatory  

- Requires both inputs to be sorted  

- Very fast on large datasets  

- Works great when sorted data is already available  

Source Code Location: `src/backend/executor/nodeMergejoin.c`

sql

-- Example: Merge Join with indexing
CREATE INDEX ON authors(id);
CREATE INDEX ON books(author_id);
EXPLAIN ANALYZE
SELECT a.name, b.title
FROM authors a
JOIN books b ON a.id = b.author_id;

3. Hash Join

How it works:

- Build a hash table of the smaller table using the join key, then scan the larger table and look up matches in the hash table.

Best for:

- Medium to large datasets with no appropriate indexes

Strengths:

- Does not require indexes or sorted data

- Very efficient in memory for large joins

Weaknesses:

- May consume a lot of memory

- Less predictable performance if memory spills to disk

- Best for medium to large datasets without indexes  

- Does not need indexes  

- Does not require sorting  

- Efficient on large data  

- Builds hash table in memory; spills to disk if memory is low  

Source Code Location: `src/backend/executor/nodeHashjoin.c`

sql
-- Example: Forcing a Hash Join
SET enable_mergejoin = off;
SET enable_nestloop = off;
EXPLAIN ANALYZE
SELECT a.name, b.title
FROM authors a
JOIN books b ON a.id = b.author_id;

How PostgreSQL Chooses Which Join to Use

PostgreSQL uses a cost-based optimizer that estimates the cost of each possible query plan. It does this by considering:

- The size of each table (number of rows and pages)

- Availability and usefulness of indexes

- Memory and CPU cost estimates

- Data distribution and column statistics

- Estimated selectivity of join conditions

The planner computes a cost for each join method and chooses the plan with the lowest estimated total cost.

You can view which plan was chosen using the `EXPLAIN` and `EXPLAIN ANALYZE` commands:

EXPLAIN ANALYZE
SELECT a.name, b.title
FROM authors a
JOIN books b ON a.id = b.author_id;

Conclusion

Understanding PostgreSQL's join algorithms is essential for writing high-performance SQL queries and optimizing your database's behavior. By learning how each join strategy works, when it is chosen, and how to influence those choices using PostgreSQL's configuration, you gain deeper insight into the query execution process.

Remember to use tools like `EXPLAIN ANALYZE` and temporarily change `enable_*` settings to see how different join methods affect performance.

Want to go deeper into the PostgreSQL source code for planner logic? Explore these files:

- `src/backend/executor/nodeHashjoin.c`: Executes hash joins.

- `src/backend/executor/nodeMergejoin.c`: Executes merge joins.

- `src/backend/executor/nodeNestloop.c`: Executes nested loop joins.

whatsapp_icon
location

Calicut

Cybrosys Technologies Pvt. Ltd.
Neospace, Kinfra Techno Park
Kakkancherry, Calicut
Kerala, India - 673635

location

Kochi

Cybrosys Technologies Pvt. Ltd.
1st Floor, Thapasya Building,
Infopark, Kakkanad,
Kochi, India - 682030.

location

Bangalore

Cybrosys Techno Solutions
The Estate, 8th Floor,
Dickenson Road,
Bangalore, India - 560042

Send Us A Message