The pgml extension for PostgreSQL, part of PostgresML, enables machine learning capabilities directly within your database, such as generating text embeddings for semantic search. Running PostgresML in a Docker container simplifies setup and ensures a consistent environment. This guide walks you through pulling the PostgresML Docker image, configuring a container, enabling the pgml and vector extensions, and using them to embed and search article data—perfect for applications like e-commerce content search. We assume Docker is already installed on your system (e.g., Ubuntu 24.04).
Features of the pgml Extension
The pgml extension empowers PostgreSQL with advanced machine learning capabilities, allowing seamless integration of AI functionalities using SQL. Here are key features that enhance its utility for developers and data professionals:
* Embedded Machine Learning: Execute ML models directly in PostgreSQL, eliminating external dependencies and simplifying data workflows for tasks like predictions and embeddings.
* Text Embedding Creation: Generate semantic vector embeddings from text using models like sentence-transformers/all-MiniLM-L6-v2, enabling intelligent search and recommendation systems.
* Efficient Vector Search: Leverage pgvector integration to store and query embeddings with fast similarity searches, optimized for large-scale datasets.
* NLP Capabilities: Perform tasks like sentiment analysis, text classification, and translation using pre-trained models, accessible via simple SQL functions.
* High-Speed Processing: Utilize GPU acceleration for rapid model inference and embedding generation, boosting performance for real-time applications.
* Scalable and Secure: Handle high transaction volumes with PostgreSQL’s robustness while keeping data secure by processing within the database environment.
These features make pgml a versatile tool for building AI-driven applications, such as enhanced search functionalities and personalized content delivery, directly within PostgreSQL.
Step 1: Pull the PostgresML Docker Image
Start by pulling the official PostgresML Docker image from GitHub Container Registry. The 2.7.12 tag is stable and includes PostgreSQL 15 with the pgml extension pre-installed.
docker pull ghcr.io/postgresml/postgresml:2.7.12
Verify the image:
docker images
You should see:
REPOSITORY TAG IMAGE ID CREATED SIZE
ghcr.io/postgresml/postgresml 2.7.12 xxxxxxxxxxxx X days ago 1.2GB
Step 2: Create a Docker Volume
To persist PostgreSQL data across container restarts, create a Docker volume named pgml_data.
docker volume create pgml_data
Check the volume:
docker volume ls
Output:
DRIVER VOLUME NAME
local pgml_data
Step 3: Run the PostgresML Container
Launch the container, mapping the volume to /var/lib/postgresql, exposing PostgreSQL on port 5434 (to avoid conflicts with local PostgreSQL on 5432), and the PostgresML dashboard on port 8000. Set a secure password for the postgres user.
docker run -d \
--name pgml_postgres \
-v pgml_data:/var/lib/postgresql \
-p 5434:5432 \
-p 8000:8000 \
-e POSTGRES_PASSWORD=cool \
ghcr.io/postgresml/postgresml:2.7.12
Verify the container is running:
docker ps
Expected:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
xxxxxxxxxxxx ghcr.io/postgresml/postgresml:2.7.12 "/usr/local/bin/dock..." Seconds ago Up seconds 0.0.0.0:5434->5432/tcp, 0.0.0.0:8000->8000/tcp pgml_postgres
Check logs to confirm startup:
docker logs pgml_postgres
Look for:
Starting PostgreSQL 15 database server ... done.
Starting dashboard
If the container exits (e.g., Exited (128)), you need to provide a command that keeps the container running. :
docker rm -f pgml_postgres
docker run -d \
--name pgml_postgres \
-v pgml_data:/var/lib/postgresql \
-p 5434:5432 \
-p 8000:8000 \
-e POSTGRES_PASSWORD=cool \
ghcr.io/postgresml/postgresml:2.7.12 \
tail -f /dev/null
The tail -f /dev/null command will keep the container running indefinitely.
Step 4: Connect to the Database
Connect to the default postgres database using psql:
psql -h localhost -p 5434 -U postgres -d postgres
Enter the password cool when prompted.
Create a dedicated database for machine learning tasks:
CREATE DATABASE postgresml;
\q
Connect to the new database:
psql -h localhost -p 5434 -U postgres -d postgresml
If authentication fails, check the container’s pg_hba.conf:
docker exec -it pgml_postgres bash
cat /var/lib/postgresql/data/pg_hba.conf
Ensure it includes:
host all all 0.0.0.0/0 md5
If not, add it, then reload:
su - postgres
psql -c "SELECT pg_reload_conf();"
exit
exit
Step 5: Enable pgml and vector Extensions
In the PostgreSQL database, enable the extensions:
CREATE EXTENSION pgml;
CREATE EXTENSION vector;
\dx
Verify:
List of installed extensions
Name | Version | Schema | Description
---------+---------+------------+-------------------------------------------------
pgml | 2.7.12 | public | PostgresML - Machine Learning in PostgreSQL
vector | 0.5.0 | public | Vector data type and similarity search
...
The pgml extension enables machine learning functions like pgml. embed, while vector provides the VECTOR type for storing embeddings.
Step 6: Example: Create an Articles Table
Create a table to store articles with a VECTOR(384) column for embeddings, suitable for models like sentence-transformers/all-MiniLM-L6-v2:
CREATE TABLE articles (
id SERIAL PRIMARY KEY,
title VARCHAR(255) NOT NULL,
summary TEXT,
publication_date DATE,
summary_embedding VECTOR(384)
);
Insert 20 sample articles for an e-commerce context:
INSERT INTO articles (title, summary, publication_date) VALUES
('Top Wireless Headphones of 2024', 'A review of the best wireless headphones, focusing on sound quality, battery life, and noise cancellation.', '2024-01-15'),
('How to Choose the Perfect Smartphone', 'Guide to selecting a smartphone based on camera, performance, and budget.', '2024-02-10'),
('Running Shoes for Every Terrain', 'Comparing top running shoes for road, trail, and marathon running.', '2024-03-05'),
('Smart Home Gadgets to Watch in 2024', 'Exploring the latest smart home devices, including speakers and thermostats.', '2024-04-20'),
('Best Laptops for Remote Work', 'Review of laptops with long battery life and fast processors for professionals.', '2024-05-12'),
('The Rise of Wearable Tech', 'How smartwatches and fitness trackers are shaping consumer trends.', '2024-06-08'),
('Coffee Makers for Home Baristas', 'A roundup of coffee makers with programmable features and sleek designs.', '2024-07-14'),
('Gaming Monitors for Competitive Play', 'Top monitors with high refresh rates and low latency for gamers.', '2024-08-19'),
('Eco-Friendly Fashion Trends', 'Sustainable footwear and clothing brands leading the market.', '2024-09-03'),
('Bluetooth Speakers for Outdoor Adventures', 'Portable speakers with waterproof designs and robust sound.', '2024-10-11'),
('5G Technology and Its Impact', 'How 5G is transforming smartphones and IoT devices.', '2024-11-25'),
('Best Air Fryers for Healthy Cooking', 'Reviews of air fryers with large capacities and smart features.', '2024-12-07'),
('Wireless Charging Explained', 'A deep dive into wireless charging technology for phones and earbuds.', '2025-01-09'),
('Top Cameras for Content Creators', 'Comparing mirrorless and DSLR cameras for video and photography.', '2025-02-14'),
('Fitness Gear for Home Workouts', 'Essential equipment like shoes and trackers for home fitness.', '2025-03-22'),
('The Future of E-Commerce Tech', 'Trends in AI and AR shaping online shopping experiences.', '2025-04-10'),
('Portable Projectors for Home Theater', 'Compact projectors with 4K support and easy setup.', '2025-04-28'),
('Smart Kitchen Appliances', 'Innovative appliances like smart ovens and blenders.', '2025-05-03'),
('Guide to Noise-Canceling Earbuds', 'Best earbuds for travel and work with superior noise cancellation.', '2025-05-10'),
('Sustainable Tech Gadgets', 'Eco-friendly tech products, from chargers to headphones.', '2025-05-14');
Step 7: Generate Embeddings with pgml.embed
Use the pgml.embed function to generate embeddings for the summary column. The sentence-transformers/all-MiniLM-L6-v2 model is fast, produces 384-dimensional embeddings, and is reliable for small datasets:
UPDATE articles
SET summary_embedding = pgml.embed('sentence-transformers/all-MiniLM-L6-v2', summary);
This may take a few seconds to minutes for 20 rows, as the model (~23MB) is downloaded from Hugging Face on first use. Check progress:
SELECT count(*) FROM articles WHERE summary_embedding IS NOT NULL;
Verify embeddings:
SELECT id, title, summary_embedding IS NOT NULL AS has_embedding
FROM articles
LIMIT 5;
Expected:
id | title | has_embedding
----+-------------------------------+---------------
1 | Top Wireless Headphones of 2024 | t
2 | How to Choose the Perfect Smartphone | t
3 | Running Shoes for Every Terrain | t
4 | Smart Home Gadgets to Watch in 2024 | t
5 | Best Laptops for Remote Work | t
If embedding is slow or fails (e.g., due to network issues), try an alternative model like BAAI/bge-small-en-v1.5:
UPDATE articles
SET summary_embedding = pgml.embed('BAAI/bge-small-en-v1.5', summary);
Now we have set embeddings for the column summary.
Step 8: Perform Semantic Search with pgvector
Use pgvector’s cosine distance operator (<=>) to perform a semantic search. For example, find articles related to “wireless technology trends” using the same model:
SELECT id, title, summary, publication_date
FROM articles
ORDER BY summary_embedding <=> pgml.embed('sentence-transformers/all-MiniLM-L6-v2', 'wireless technology trends')::vector
LIMIT 3;
Expected:
id | title | summary | publication_date
----+------------------------------+-----------------------------------------------------+------------------
11 | 5G Technology and Its Impact | How 5G is transforming smartphones and IoT devices. | 2024-11-25
13 | Wireless Charging Explained | A deep dive into wireless charging technology... | 2025-01-09
1 | Top Wireless Headphones of 2024 | A review of the best wireless headphones... | 2024-01-15
Step 9: Optimize and Scale
For larger datasets, create an index to speed up searches:
CREATE INDEX ON articles USING ivfflat (summary_embedding vector_cosine_ops) WITH (lists = 100);
Access the PostgresML dashboard at http://localhost:8000 to visualize queries and data. If it’s not loading:
docker logs pgml_postgres | grep dashboard
sudo ss -tuln | grep 8000
Conclusion
Using the pgml extension in PostgreSQL with Docker enables powerful machine learning features like text embeddings and semantic search. By pulling the PostgresML image, setting up a container, enabling extensions, and embedding article data, you can enhance e-commerce applications with intelligent search capabilities. For production, use a strong password, consider PostgresML’s cloud service, and add indexes for scalability. Explore the dashboard and experiment with models to optimize your use case! Learn how to monitor & understand PostgreSQL background processes to ensure optimal database performance, troubleshoot issues efficiently, and gain deeper insight into how PostgreSQL manages tasks behind the scenes.