When working with large datasets in PostgreSQL, full-table scans can be resource-intensive and slow. In many analytical and exploratory tasks, users often don’t need to scan the entire table, just a small sample may be sufficient. This is where TABLESAMPLE comes into play.
PostgreSQL provides two popular sampling methods through extensions: tsm_system_rows and tsm_system_time. While they both serve the same purpose, sampling data, they operate in fundamentally different ways. This article explores how these two sampling methods work, their strengths, limitations, and how to choose the right one.
What is TABLESAMPLE in PostgreSQL?
The TABLESAMPLE clause in SQL allows you to retrieve a subset of rows from a table in a fast and lightweight manner. This is useful for:
- Quick exploratory data analysis
- Performance testing
- Building ML prototypes
- Running approximate queries
The TABLESAMPLE clause requires a sampling method. PostgreSQL supports system-level methods like:
- SYSTEM (built-in)
- BERNOULLI (built-in)
- SYSTEM_ROWS (via tsm_system_rows extension)
- SYSTEM_TIME (via tsm_system_time extension)
1. tsm_system_rows: Sample a Fixed Number of Rows
The tsm_system_rows extension allows you to sample approximately N rows from a table:
SELECT * FROM my_table TABLESAMPLE SYSTEM_ROWS(100);
How it works:
- Internally, it selects random blocks (pages) from the table.
- From these blocks, it extracts tuples (rows).
- It continues until approximately N rows have been collected.
- It does not scan all rows or use row-level randomness.
Is it repeatable?
No. Each call may return different rows. PostgreSQL does not remember which pages were picked last time, and there's no fixed seed by default.
Pros:
- Extremely fast, even for large tables.
- No need to compute total row counts.
- Great for sampling in OLAP or dashboard tools.
Cons:
- Not repeatable.
- Not statistically uniform, some rows may have a higher selection probability if stored in earlier blocks.
- Sensitive to the physical layout of the table.
2. tsm_system_time: Sample for a Fixed Amount of Time
The tsm_system_time extension allows you to scan rows for a specific duration (in milliseconds):
SELECT * FROM my_table TABLESAMPLE SYSTEM_TIME(500);
How it works:
- Starts scanning blocks sequentially.
- Continues until the allotted time budget is reached.
- Aborts gracefully when the timeout is hit.
Use Case:
You want to limit CPU usage or get whatever rows you can in N milliseconds, regardless of count.
Pros:
- Useful when you want to cap query duration.
- Adaptable to table size, small tables may return many rows, big tables fewer.
- Fast and approximate.
Cons:
- May return very different row counts for each run.
- Highly dependent on system load and I/O speed.
- Also, not repeatable.
Comparison Table
| Feature | tsm_system_rows | tsm_system_time |
| Sample type | Approximate fixed row count | Rows scanned within time |
| Repeatable? | No | No |
| Fast for large tables? | Yes | Yes |
| Time-bounded? | No | Yes |
| Row-uniform sampling? | No | No |
| Control by time? | No | Yes |
| Requires extension? | Yes (tsm_system_rows) | Yes (tsm_system_time) |
Installing the Extensions
To use either sampling method, you must first install the respective extension:
-- Enable tsm_system_rows
CREATE EXTENSION tsm_system_rows;
-- Enable tsm_system_time
CREATE EXTENSION tsm_system_time;
Best Practices
- Use tsm_system_rows when you care about row count but not repeatability.
- Use tsm_system_time when you want to control query cost and time.
- Don’t use either for statistically accurate sampling. Use ORDER BY random() LIMIT n with a seed for that (though it’s slower).
Avoid using sampled rows for critical decisions, use them for approximation, testing, or prototyping.
PostgreSQL’s tsm_system_rows and tsm_system_time give developers and data engineers powerful tools for quick sampling from large datasets. While both are fast and low-cost, they are inherently non-deterministic and not suitable for high-precision statistical analysis.
However, in real-world applications, especially in large-scale systems, these extensions can drastically reduce load, improve response time for dashboards, and help developers gain quick insights. Whether you're debugging a data issue or building a fast AI pipeline prototype, knowing when and how to use these sampling methods can save you hours of work.