How to Use Table Sampling in PostgreSQL

When working with large datasets in PostgreSQL, you often need to analyze a subset of data rather than processing entire tables. Whether you're performing exploratory data analysis, testing queries on production data, or creating representative samples for statistical analysis, PostgreSQL's table sampling functionality provides an efficient solution.

What is Table Sampling?

Table sampling allows you to retrieve a random subset of rows from a table without reading the entire dataset. PostgreSQL implements the SQL standard TABLESAMPLE clause, which can significantly improve query performance when working with large tables where a representative sample is sufficient for your analysis.

PostgreSQL Sampling Methods

PostgreSQL provides two built-in sampling methods:

1. BERNOULLI Sampling

The BERNOULLI method performs row-level sampling, where each row has an independent probability of being selected. This method scans the entire table but only returns the sampled rows.

-- Sample approximately 10% of rows
SELECT * FROM large_table 
TABLESAMPLE BERNOULLI(10);

Characteristics:

* Each row is independently selected with the specified probability

* More random, but potentially slower for large tables

* Sample size can vary significantly from the expected percentage

* Provides true random sampling at the row level

2. SYSTEM Sampling

The SYSTEM method performs block-level sampling, randomly selecting entire data pages (blocks) and returning all rows from those pages. This is typically faster but less random than BERNOULLI.

-- Sample approximately 10% of data blocks
SELECT * FROM large_table 
TABLESAMPLE SYSTEM(10);

Characteristics:

* Samples entire data blocks rather than individual rows

* Generally faster, especially for large tables

* Less random than BERNOULLI (spatial clustering)

* Sample size depends on data distribution across blocks

Syntax and Parameters

The basic syntax for table sampling is:

SELECT columns
FROM table_name 
TABLESAMPLE sampling_method(percentage) 
[REPEATABLE(seed)]
[WHERE conditions]
[ORDER BY columns];

Parameters Explained

* sampling_method: Either BERNOULLI or SYSTEM

* percentage: A value between 0 and 100 representing the sampling percentage

* REPEATABLE(seed): Optional parameter to make sampling deterministic using a seed value

* seed: An integer value that ensures reproducible results

Practical Examples

Basic Sampling

-- Get a 1% sample using BERNOULLI method
SELECT customer_id, order_date, total_amount
FROM orders 
TABLESAMPLE BERNOULLI(1);
-- Get a 5% sample using SYSTEM method
SELECT product_id, category, price
FROM products 
TABLESAMPLE SYSTEM(5);

Reproducible Sampling

-- Use REPEATABLE to get consistent results
SELECT * FROM sales_data 
TABLESAMPLE BERNOULLI(2) REPEATABLE(12345);

Stratified Sampling with Window Functions

-- Sample proportionally from each category
WITH sampled_products AS (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY category ORDER BY RANDOM()) as rn,
         COUNT(*) OVER (PARTITION BY category) as category_count
  FROM products
)
SELECT product_id, product_name, category, price
FROM sampled_products
WHERE rn <= GREATEST(1, category_count * 0.1);  -- 10% from each category

Performance Considerations

When to Use BERNOULLI vs SYSTEM

Use BERNOULLI when:

* You need truly random sampling

* Row-level randomness is important

* Working with smaller to medium-sized tables

* Statistical accuracy is critical

Use SYSTEM when:

* Performance is the primary concern

* Working with very large tables

* Approximate randomness is acceptable

* You need faster query execution

Conclusion

Table sampling in PostgreSQL is a powerful feature for working efficiently with large datasets. By understanding the differences between BERNOULLI and SYSTEM sampling methods, you can choose the right approach for your specific use case. Whether you're performing quick data quality checks, developing and testing queries, or conducting statistical analysis, table sampling can significantly improve performance while providing representative results.

Remember to consider the trade-offs between randomness and performance, use the REPEATABLE clause when consistency is important, and always validate that your samples are representative of your complete dataset. With these techniques, you can effectively analyze large PostgreSQL tables without the overhead of processing entire datasets. Explore the complete lifecycle of data removal in PostgreSQL with our guide, "How to Understand Tuple Deletion in PostgreSQL: From Query to Cleanup." This article demystifies the process, from the initial SQL DELETE command to the essential background tasks that reclaim disk space and maintain database health.

whatsapp_icon
location

Calicut

Cybrosys Technologies Pvt. Ltd.
Neospace, Kinfra Techno Park
Kakkancherry, Calicut
Kerala, India - 673635

location

Kochi

Cybrosys Technologies Pvt. Ltd.
1st Floor, Thapasya Building,
Infopark, Kakkanad,
Kochi, India - 682030.

location

Bangalore

Cybrosys Techno Solutions
The Estate, 8th Floor,
Dickenson Road,
Bangalore, India - 560042

Send Us A Message