In PostgreSQL, a cluster is a single PostgreSQL server instance that manages multiple databases using one data directory (PGDATA).
Initializing a PostgreSQL cluster looks simple on the surface. You run one command:
initdb -D mydata
And PostgreSQL replies with a few friendly messages like:
creating directory ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
But internally, PostgreSQL performs many carefully ordered steps involving filesystem setup, locale handling, a special bootstrap backend, and system catalog creation.
This blog explains exactly what happens internally when PostgreSQL initializes a new cluster, in a way that beginners can understand, while still being accurate to the PostgreSQL source code.
What Does “Initialize a Cluster” Mean in PostgreSQL?
In PostgreSQL, a cluster is:
- A data directory
- A set of system catalogs
- A shared WAL (write ahead log) and control structure
- A collection of databases (template0, template1, postgres, and user DBs)
Running initdb does not start a server.
It creates the physical and logical foundation required for PostgreSQL to run.
How initdb Works Internally
Internally, initdb works in two worlds:
- Frontend program (initdb.c)
- Backend bootstrap server (postgres --boot)
Think of it like this:
initdb (frontend)
|
|-- prepares filesystem, config, scripts
|
|-- starts postgres in bootstrap mode
| |
| |-- creates system catalogs
|
|-- runs post-bootstrap SQL
|
|-- creates template databases
The 8 Internal Stages of initializing a new cluster in postgresql
Even though initdb.c is over 3000 lines long, the entire process can be understood in 8 clear stages.
1. Locale and Environment Setup
Before PostgreSQL writes anything to disk, it must ensure that:
- Locale names are valid
- Encoding matches locale
- ICU or libc collation is consistent
Relevant source code functions (from initdb.c):
save_global_locale()
restore_global_locale()
check_locale_name()
check_locale_encoding()
check_icu_locale_encoding()
setlocales()
set_info_version()
What happens here?
- Reads environment variables like LC_ALL, LANG
- Validates encoding compatibility (UTF-8, etc.)
- Prepares collation rules
- Stores version information for Information Schema
At this stage:
- No directories exist
- No files are created
2. Creating the Data Directory
Now PostgreSQL creates the cluster directory structure.
Key functions:
create_data_directory()
initialize_data_directory()
setup_data_file_paths()
What is created?
$PGDATA/
+-- base/
+-- global/
+-- pg_xact/
+-- pg_multixact/
+-- pg_commit_ts/
+-- pg_subtrans/
+-- pg_tblspc/
And the most critical file:
global/pg_control
pg_control stores:
- System identifier
- WAL state
- Next transaction ID
PostgreSQL cannot start without this file.
3. WAL Directory Initialization
PostgreSQL now prepares the Write-Ahead Log (WAL).
Source code function:
create_xlog_or_symlink()
Result:
pg_wal/
(or a symbolic link, if configured)
This stage ensures crash safety before any catalogs exist.
4. Configuration File Creation
Now PostgreSQL creates minimal configuration files so the backend can start.
Source code functions:
choose_dsm_implementation()
setup_config()
Files created:
- postgresql.conf
- pg_hba.conf
- pg_ident.conf
This matches the output:
creating configuration files ... ok
At this point:
- PostgreSQL still has no system catalogs
- But it is ready to start a backend process
5. Bootstrap Backend – System Catalog Creation
This is the most important stage.
PostgreSQL has a problem:
System catalogs are tables, but tables cannot exist until catalogs exist.
The solution is bootstrap mode.
Frontend function (in initdb.c):
bootstrap_template1()
What this function does:
- Prepares a BKI script (postgres.bki)
Starts the backend like this:
- postgres --boot
- Sends catalog definitions to the backend
Backend execution (server source code)
File:
src/backend/bootstrap/bootstrap.c
Main function:
BootstrapModeMain()
This backend:
- Runs as a single process
- No WAL
- No shared buffers
- No SQL executor
- No background workers
What catalogs are created here?
Examples:
- pg_class
- pg_attribute
- pg_type
- pg_proc
- pg_namespace
- pg_database
These definitions come from:
src/include/catalog/*.h
Example:
CATALOG(pg_class, 1259, RelationRelationId)
Physical storage is created using:
heap_create_with_catalog()
At the end of this stage:
- PostgreSQL finally has working system catalogs
- A real database (template1) now exists
6. Post-Bootstrap Catalog Population
Now PostgreSQL can execute normal SQL.
The frontend generates SQL and sends it to the backend.
Source code functions:
setup_depend()
setup_description()
setup_collation()
setup_privileges()
setup_schema()
setup_run_file()
What is loaded?
- pg_collation
- Initial privileges
- Information Schema
- System views
From files like:
- information_schema.sql
- system_views.sql
- system_functions.sql
7. Creating Default Databases
PostgreSQL now creates the standard databases.
Source code functions:
make_template0()
make_postgres()
Actual order:
- template1 – created during bootstrap
- template0 – copied from template1 and frozen
- postgres – copied from template1
Database creation is done by filesystem copying, not SQL row-by-row inserts.
This is why:
- template0 is never modified
- New databases are created instantly
8. Final Sync and Completion
The final step ensures durability.
Source code function:
sync_data_directory()
This:
- Flushes all data to disk
- Makes the cluster crash-safe
Output:
syncing data to disk ... ok
At this point, the PostgreSQL cluster is ready.
Final Execution Flow (Simplified)
initdb
+-- locale checks
+-- data directory creation
+-- WAL setup
+-- config file creation
+-- bootstrap backend
¦ +-- system catalogs created
+-- post-bootstrap SQL
+-- template database creation
+-- disk sync
Why Understanding initdb Matters
Understanding how initdb works helps you:
- Modify PostgreSQL system catalogs safely
- Debug cluster initialization failures
- Understand why catalog OIDs are fixed
- Work confidently with PostgreSQL source code
- Build strong fundamentals in database internals
Learning how PostgreSQL initializes a cluster is a solid starting point for developers who want to explore the PostgreSQL source code. The core of PostgreSQL lies in the data directory, where understanding system catalog creation and WAL directory initialization is essential for a database administrator.
A clear idea of the cluster initialization process helps in understanding how PostgreSQL manages metadata, storage, and recovery from the very beginning. This knowledge also makes it easier to debug low-level issues and confidently work with PostgreSQL internals.