PostgreSQL provides several indexing mechanisms that help improve query performance. The most commonly used index type is the B-tree index, which is the default indexing method. However, PostgreSQL also supports other index access methods through extensions. One such extension is the Bloom extension, which allows the creation of Bloom indexes.
Bloom indexes are useful when queries filter multiple columns together. Instead of storing the exact column values, a Bloom index stores hashed signatures of those values. This design reduces the size of the index and allows PostgreSQL to quickly filter potential matching rows. In this article, we will walk through a complete example that demonstrates how to use a Bloom index and compare it with a traditional B-tree index.
The example includes creating a large dataset, building both index types, and analyzing their behavior during query execution.
Enabling the Bloom Extension
Before using a Bloom index, the extension must be installed in the database.
Connect to PostgreSQL as the postgres user and create the extension.
CREATE EXTENSION bloom;
This registers the Bloom index access method and its related operator classes.
\dx+ bloom
Result :
Objects in extension "bloom"
Object description
--------------------------------------------------
access method bloom
function blhandler(internal)
operator class int4_ops for access method bloom
operator class text_ops for access method bloom
operator family int4_ops for access method bloom
operator family text_ops for access method bloom
(6 rows)
Creating a Schema
To organize the demonstration objects, create a separate schema.
CREATE SCHEMA analytics;
Creating a Large Test Table
Next, create a table that contains several integer columns and a text column. These columns will later be indexed.
CREATE TABLE analytics.transactions (
col_a int,
col_b int,
col_c int,
col_d int,
col_e int,
col_f int,
col_g int,
col_h int,
description text,
region_code int
);
This table structure contains multiple columns so that we can test how different indexes behave when filtering across several attributes.
Generating a Large Dataset
To properly evaluate index performance, the table must contain a large number of rows. The following query inserts one hundred million rows with randomly generated values.
INSERT INTO analytics.transactions
SELECT
(random() * 1000000)::int,
(random() * 1000000)::int,
(random() * 1000000)::int,
(random() * 1000000)::int,
(random() * 1000000)::int,
(random() * 1000000)::int,
(random() * 1000000)::int,
(random() * 1000000)::int,
md5(g::text),
floor(random() * (20000 - 9999 + 1) + 9999)
FROM generate_series(1,100*1e6) g;
After the data is inserted, the table size becomes several gigabytes.
You can confirm this with:
\dt+ analytics.transactions
Result :
List of tables
Schema | Name | Type | Owner | Persistence | Access method | Size | Description
-----------+--------------+-------+----------+-------------+---------------+---------+-------------
analytics | transactions | table | postgres | permanent | heap | 9647 MB |
(1 row)
Creating a Multi-Column B-Tree Index
The first indexing approach uses a traditional B-tree index.
CREATE INDEX idx_btree_transactions
ON analytics.transactions
(col_a, col_b, col_c, col_d, col_e, col_f, col_g, region_code);
After the index is created, its size can be inspected.
\di+ analytics.idx_btree_transactions
Result :
List of indexes
Schema | Name | Type | Owner | Table | Persistence | Access method | Size | Description
-----------+------------------------+-------+----------+--------------+-------------+---------------+---------+-------------
analytics | idx_btree_transactions | index | postgres | transactions | permanent | btree | 4723 MB |
(1 row)
In a large dataset, this type of index may occupy several gigabytes because B-tree indexes store actual column values.
Testing Query Performance with the B-Tree Index
Now run a query that filters using columns that are not leading columns in the index.
EXPLAIN ANALYZE
SELECT *
FROM analytics.transactions
WHERE col_d = 295294
AND region_code = 13266;
Even though a B-tree index exists, PostgreSQL may choose a parallel sequential scan instead of using the index. This happens because the query conditions do not match the index column order effectively.
A typical execution plan may look like this:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..1860818.00 rows=2500 width=68) (actual time=5411.588..5418.546 rows=0.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=15707 read=1218861
-> Parallel Seq Scan on transactions (cost=0.00..1859568.00 rows=1042 width=68) (actual time=5396.882..5396.883 rows=0.00 loops=3)
Filter: ((col_d = 295294) AND (region_code = 13266))
Rows Removed by Filter: 33333333
Buffers: shared hit=15707 read=1218861
Planning:
Buffers: shared hit=5 read=1
Planning Time: 0.186 ms
JIT:
Functions: 6
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 1.556 ms (Deform 0.335 ms), Inlining 88.957 ms, Optimization 51.289 ms, Emission 32.547 ms, Total 174.350 ms
Execution Time: 5418.848 ms
(16 rows)
In this scenario, PostgreSQL scans the entire table because the B-tree index is not optimal for this query pattern.
Creating a Bloom Index
Next, create a Bloom index that includes several columns.
CREATE INDEX idx_bloom_transactions
ON analytics.transactions
USING bloom(col_a, col_b, col_c, col_d, col_e, col_f, col_g, region_code);
Once the index is created, check its size.
\di+ analytics.idx_bloom_transactions
Result :
List of indexes
Schema | Name | Type | Owner | Table | Persistence | Access method | Size | Description
-----------+------------------------+-------+----------+--------------+-------------+---------------+---------+-------------
analytics | idx_bloom_transactions | index | postgres | transactions | permanent | bloom | 1532 MB |
(1 row)
You will notice that the Bloom index is significantly smaller than the B-tree index. This happens because Bloom indexes store compact hash signatures instead of full values.
Testing Query Performance with the Bloom Index
Now run another query that filters two columns included in the Bloom index.
EXPLAIN ANALYZE
SELECT *
FROM analytics.transactions
WHERE col_f = 326756
AND col_g = 597560;
The execution plan now shows PostgreSQL using the Bloom index.
A typical execution plan appears as:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on transactions (cost=1784320.62..1794013.03 rows=2500 width=68) (actual time=3099.199..3099.204 rows=0.00 loops=1)
Recheck Cond: ((col_f = 326756) AND (col_g = 597560))
Rows Removed by Index Recheck: 27015421
Heap Blocks: exact=87996 lossy=332197
Buffers: shared hit=16077 read=600195 written=11890
-> Bitmap Index Scan on idx_bloom_transactions (cost=0.00..1784320.00 rows=2500 width=0) (actual time=737.585..737.587 rows=512240.00 loops=1)
Index Cond: ((col_f = 326756) AND (col_g = 597560))
Index Searches: 1
Buffers: shared hit=16077 read=180002
Planning:
Buffers: shared hit=4 read=1
Planning Time: 0.170 ms
JIT:
Functions: 2
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 0.244 ms (Deform 0.132 ms), Inlining 1.346 ms, Optimization 11.471 ms, Emission 6.835 ms, Total 19.896 ms
Execution Time: 3099.564 ms
(17 rows)
Instead of scanning the entire table, PostgreSQL uses the Bloom index to quickly identify candidate rows.
Understanding the Recheck Step
One important part of the execution plan is the Recheck Cond step.
Bloom indexes use probabilistic hashing, which means the index may sometimes return rows that appear to match the query condition but actually do not. PostgreSQL verifies these rows by checking the actual table data.
This verification step ensures correct query results while still benefiting from the speed of the Bloom index.
Comparing Index Sizes
A noticeable difference between the two index types is their size.
The B-tree index stores the actual indexed values and therefore consumes more disk space when many columns are included. Bloom indexes store hashed signatures, which greatly reduces storage requirements.
For very large datasets with many indexed columns, this difference can be significant.
Observations from the Experiment
This experiment demonstrates several important characteristics of both index types.
A B-tree index is highly efficient when queries filter using the leading columns of the index. However, if queries filter columns that appear later in the index definition, PostgreSQL may not be able to use the index effectively.
Bloom indexes provide a flexible alternative when queries frequently filter on different combinations of columns. Because all indexed columns are encoded into a compact signature, the index can still help filter rows even if the query uses a subset of those columns.
Although Bloom indexes may produce false positives, PostgreSQL resolves this by rechecking the rows in the table.
PostgreSQL offers a powerful indexing system that allows developers to choose the most suitable strategy for their workload. B-tree indexes remain the standard option for most applications because they provide precise lookups and support a wide range of queries.
Bloom indexes are designed for scenarios where many columns are involved in filtering conditions, and the index size must remain compact. By storing hashed signatures rather than full values, Bloom indexes reduce storage overhead and allow PostgreSQL to quickly identify candidate rows.
Understanding these indexing techniques can help database engineers design more efficient systems when working with large datasets and complex query patterns.