If you’re building a RAG pipeline or any application that needs semantic search, you’ll eventually need to decide where to store your embeddings. Dedicated vector databases (Pinecone, Qdrant, Weaviate) are one option. But for most teams — especially those already running PostgreSQL — pgvector is the faster, cheaper, and operationally simpler path.
pgvector is an open-source PostgreSQL extension that adds a vector data type, similarity search operators, and HNSW/IVFFlat indexes to your existing Postgres database. Your embeddings live in the same database as your application data, queryable with standard SQL, backed by Postgres ACID guarantees.
This tutorial covers everything from installation to production-ready RAG queries — with real code throughout.
What you’ll build: A working semantic search system that stores document embeddings in PostgreSQL and retrieves the most relevant chunks for a given query — the retrieval layer of a RAG pipeline.
Stack:
- PostgreSQL 15+ with pgvector 0.7+
- Python 3.11
openaiorollamafor embeddingspsycopg2/asyncpgfor database access
What is pgvector and why use it?
A vector embedding is a list of floating-point numbers — typically 768 to 3,072 dimensions — that encodes the semantic meaning of a piece of text, image, or other data. Two pieces of text with similar meaning produce vectors that are close together in that high-dimensional space.
pgvector adds the ability to store these vectors in Postgres and run distance calculations across them efficiently:
| Operation | SQL |
|---|---|
| Cosine similarity | embedding <=> query_vector |
| L2 distance | embedding <-> query_vector |
| Inner product | embedding <#> query_vector |
When pgvector is the right choice:
- You’re already running PostgreSQL
- Your vector count is under ~5 million
- You want to JOIN embeddings with relational data (user IDs, document metadata, access control)
- You need ACID transactions across your app data and embeddings
- You want to avoid managing another infrastructure component
When to use a dedicated vector DB instead:
- 10M+ vectors with sub-10ms latency requirements
- You need advanced filtering at massive scale (Qdrant handles this better)
- Your team has no existing Postgres infrastructure
For most ASEAN and Japan enterprise RAG projects — internal document search, contract retrieval, SOC alert enrichment — pgvector is sufficient and significantly simpler to operate.
Architecture overview
flowchart TD
A["Documents\n(PDF, text, DB records)"] --> B["Chunking\n(split into ~500 token segments)"]
B --> C["Embedding model\n(OpenAI / Ollama / local)"]
C --> D["pgvector\n(PostgreSQL)"]
E["User query"] --> F["Embed query"]
F --> G["Similarity search\n(cosine / L2)"]
D --> G
G --> H["Top-k chunks"]
H --> I["LLM\n(GPT / Claude / local)"]
I --> J["Answer with citations"]
Step 1 — Install pgvector
Option A: Docker (recommended for development)
docker run -d \
--name pgvector-dev \
-e POSTGRES_PASSWORD=secret \
-e POSTGRES_DB=ragdb \
-p 5432:5432 \
pgvector/pgvector:pg16
This image ships with pgvector pre-installed. No extension build needed.
Option B: Install on existing PostgreSQL
Ubuntu / Debian:
sudo apt install postgresql-16-pgvector
macOS (Homebrew):
brew install pgvector
From source (any platform):
cd /tmp
git clone https://github.com/pgvector/pgvector.git
cd pgvector
make
sudo make install
Enable the extension in your database
-- Connect to your target database
CREATE EXTENSION IF NOT EXISTS vector;
-- Verify installation
SELECT * FROM pg_extension WHERE extname = 'vector';
Step 2 — Create the schema
-- Documents table: stores source text and metadata
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
title TEXT NOT NULL,
source_url TEXT,
lang VARCHAR(5) DEFAULT 'en',
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Chunks table: stores text segments and their embeddings
CREATE TABLE document_chunks (
id BIGSERIAL PRIMARY KEY,
document_id BIGINT REFERENCES documents(id) ON DELETE CASCADE,
chunk_index INTEGER NOT NULL,
content TEXT NOT NULL,
token_count INTEGER,
embedding vector(1536), -- dimension must match your embedding model
created_at TIMESTAMPTZ DEFAULT NOW()
);
Embedding dimensions by model:
| Model | Dimensions |
|---|---|
| OpenAI text-embedding-3-small | 1536 |
| OpenAI text-embedding-3-large | 3072 |
| OpenAI text-embedding-ada-002 | 1536 |
| Ollama nomic-embed-text | 768 |
| Ollama mxbai-embed-large | 1024 |
| BGE-M3 (multilingual) | 1024 |
The vector(1536) column type must match exactly. You cannot mix dimensions in the same column.
Step 3 — Ingest documents and generate embeddings
import os
import psycopg2
from openai import OpenAI
import tiktoken
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
DB_URL = "postgresql://postgres:secret@localhost:5432/ragdb"
def chunk_text(text: str, max_tokens: int = 500, overlap: int = 50) -> list[str]:
"""Split text into overlapping chunks by token count."""
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = min(start + max_tokens, len(tokens))
chunk_tokens = tokens[start:end]
chunks.append(enc.decode(chunk_tokens))
start += max_tokens - overlap # overlap for context continuity
return chunks
def get_embedding(text: str) -> list[float]:
"""Get embedding vector from OpenAI."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=text.replace("\n", " ") # newlines reduce embedding quality
)
return response.data[0].embedding
def ingest_document(title: str, content: str, source_url: str = None, lang: str = "en"):
"""Chunk a document, embed each chunk, and store in PostgreSQL."""
conn = psycopg2.connect(DB_URL)
cur = conn.cursor()
try:
# Insert document record
cur.execute(
"INSERT INTO documents (title, source_url, lang) VALUES (%s, %s, %s) RETURNING id",
(title, source_url, lang)
)
doc_id = cur.fetchone()[0]
# Chunk and embed
chunks = chunk_text(content)
print(f"Ingesting '{title}': {len(chunks)} chunks")
for i, chunk in enumerate(chunks):
embedding = get_embedding(chunk)
cur.execute(
"""
INSERT INTO document_chunks (document_id, chunk_index, content, token_count, embedding)
VALUES (%s, %s, %s, %s, %s)
""",
(doc_id, i, chunk, len(chunk.split()), embedding)
)
conn.commit()
print(f"Done: document_id={doc_id}")
return doc_id
except Exception as e:
conn.rollback()
raise e
finally:
cur.close()
conn.close()
# Usage
ingest_document(
title="Company Security Policy v3.2",
content=open("security_policy.txt").read(),
source_url="https://internal.company.com/policies/security",
lang="en"
)
Using Ollama for local embeddings (no API key required)
import httpx
def get_embedding_ollama(text: str, model: str = "nomic-embed-text") -> list[float]:
"""Get embedding from local Ollama instance."""
response = httpx.post(
"http://localhost:11434/api/embeddings",
json={"model": model, "prompt": text}
)
return response.json()["embedding"]
Replace get_embedding() with get_embedding_ollama() throughout. Change the vector dimension to 768 in your schema if using nomic-embed-text.
Step 4 — Create indexes for fast similarity search
Without an index, pgvector performs exact nearest-neighbor search — accurate but O(n) for every query. For production, you need an approximate nearest-neighbor (ANN) index.
pgvector supports two index types:
HNSW (recommended for most cases)
-- Create HNSW index for cosine similarity
CREATE INDEX ON document_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
HNSW parameters:
| Parameter | Default | Effect |
|---|---|---|
m |
16 | Connections per node. Higher = better recall, more memory |
ef_construction |
64 | Build-time search width. Higher = better recall, slower build |
For most RAG use cases, the defaults are fine. Increase m=32, ef_construction=128 if recall is critical (legal documents, compliance search).
IVFFlat (better for very large datasets)
-- First, you need data in the table before creating this index
CREATE INDEX ON document_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
lists should be roughly sqrt(row_count). IVFFlat builds faster than HNSW but has lower recall by default. Tune SET ivfflat.probes = 10 at query time to trade speed for recall.
Which to choose:
| HNSW | IVFFlat | |
|---|---|---|
| Recall | Higher | Lower (tunable) |
| Build time | Slower | Faster |
| Memory | Higher | Lower |
| Best for | <2M vectors, high recall | >2M vectors, batch ingestion |
For most RAG projects, use HNSW.
Step 5 — Run similarity search queries
def semantic_search(
query: str,
top_k: int = 5,
lang_filter: str = None,
min_similarity: float = 0.7
) -> list[dict]:
"""Find the most semantically similar chunks to a query."""
query_embedding = get_embedding(query)
conn = psycopg2.connect(DB_URL)
cur = conn.cursor()
try:
# Set ef_search higher than default for better recall
cur.execute("SET hnsw.ef_search = 100")
sql = """
SELECT
dc.id,
dc.content,
dc.chunk_index,
d.title,
d.source_url,
d.lang,
1 - (dc.embedding <=> %s::vector) AS similarity
FROM document_chunks dc
JOIN documents d ON d.id = dc.document_id
WHERE 1 - (dc.embedding <=> %s::vector) >= %s
{lang_clause}
ORDER BY dc.embedding <=> %s::vector
LIMIT %s
""".format(
lang_clause="AND d.lang = %s" if lang_filter else ""
)
params = [query_embedding, query_embedding, min_similarity]
if lang_filter:
params.append(lang_filter)
params.extend([query_embedding, top_k])
cur.execute(sql, params)
rows = cur.fetchall()
return [
{
"id": row[0],
"content": row[1],
"chunk_index": row[2],
"title": row[3],
"source_url": row[4],
"lang": row[5],
"similarity": float(row[6])
}
for row in rows
]
finally:
cur.close()
conn.close()
# Usage
results = semantic_search(
query="What is the password policy for privileged accounts?",
top_k=5,
lang_filter="en",
min_similarity=0.75
)
for r in results:
print(f"[{r['similarity']:.3f}] {r['title']} — chunk {r['chunk_index']}")
print(f" {r['content'][:200]}...")
print()
Step 6 — Build the full RAG pipeline
Combine retrieval with an LLM to answer questions grounded in your documents:
def rag_query(
question: str,
lang: str = "en",
top_k: int = 5,
model: str = "gpt-4o-mini"
) -> dict:
"""Full RAG pipeline: retrieve relevant chunks and generate a grounded answer."""
# Step 1: retrieve relevant chunks
chunks = semantic_search(question, top_k=top_k, lang_filter=lang)
if not chunks:
return {
"answer": "No relevant documents found for this question.",
"sources": [],
"chunks_used": 0
}
# Step 2: build context string with source citations
context_parts = []
sources = []
for i, chunk in enumerate(chunks):
context_parts.append(f"[{i+1}] {chunk['content']}")
sources.append({
"title": chunk["title"],
"url": chunk["source_url"],
"similarity": chunk["similarity"]
})
context = "\n\n".join(context_parts)
# Step 3: generate answer with LLM
system_prompt = """You are a helpful assistant. Answer the question using ONLY the provided context.
If the context does not contain the answer, say "I cannot find this information in the available documents."
Always cite the source number [1], [2], etc. when referencing information."""
user_prompt = f"""Context:
{context}
Question: {question}
Answer:"""
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0.1 # low temperature for factual grounding
)
return {
"answer": response.choices[0].message.content,
"sources": sources,
"chunks_used": len(chunks)
}
# Usage
result = rag_query(
question="What are the password requirements for admin accounts?",
lang="en"
)
print(result["answer"])
print("\nSources:")
for s in result["sources"]:
print(f" - {s['title']} (similarity: {s['similarity']:.3f})")
Step 7 — Hybrid search (vector + full-text)
Pure vector search sometimes misses exact keyword matches. Hybrid search combines vector similarity with PostgreSQL full-text search for better recall:
-- Add a full-text search column
ALTER TABLE document_chunks ADD COLUMN content_tsv tsvector
GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;
CREATE INDEX ON document_chunks USING gin(content_tsv);
def hybrid_search(query: str, top_k: int = 10, rrf_k: int = 60) -> list[dict]:
"""
Reciprocal Rank Fusion: combines vector search and full-text search rankings.
RRF score = 1/(k + rank_vector) + 1/(k + rank_fts)
"""
query_embedding = get_embedding(query)
conn = psycopg2.connect(DB_URL)
cur = conn.cursor()
try:
cur.execute("SET hnsw.ef_search = 100")
cur.execute("""
WITH vector_ranked AS (
SELECT id, content, document_id,
ROW_NUMBER() OVER (ORDER BY embedding <=> %s::vector) AS rank
FROM document_chunks
ORDER BY embedding <=> %s::vector
LIMIT 20
),
fts_ranked AS (
SELECT id, content, document_id,
ROW_NUMBER() OVER (ORDER BY ts_rank(content_tsv, query) DESC) AS rank
FROM document_chunks,
plainto_tsquery('english', %s) query
WHERE content_tsv @@ query
LIMIT 20
),
rrf AS (
SELECT
COALESCE(v.id, f.id) AS id,
COALESCE(v.content, f.content) AS content,
COALESCE(v.document_id, f.document_id) AS document_id,
COALESCE(1.0 / (%s + v.rank), 0) +
COALESCE(1.0 / (%s + f.rank), 0) AS rrf_score
FROM vector_ranked v
FULL OUTER JOIN fts_ranked f ON v.id = f.id
)
SELECT r.id, r.content, d.title, d.source_url, r.rrf_score
FROM rrf r
JOIN documents d ON d.id = r.document_id
ORDER BY rrf_score DESC
LIMIT %s
""", [query_embedding, query_embedding, query, rrf_k, rrf_k, top_k])
return [
{"id": row[0], "content": row[1], "title": row[2],
"source_url": row[3], "rrf_score": float(row[4])}
for row in cur.fetchall()
]
finally:
cur.close()
conn.close()
Hybrid search improves recall by 15–30% on domain-specific corpora where users mix keyword and semantic queries. Use it when your documents contain product codes, regulation numbers, or proper nouns that embeddings handle poorly.
Performance tuning
Index build memory
-- Increase memory for faster HNSW index builds
SET maintenance_work_mem = '1GB';
CREATE INDEX ON document_chunks USING hnsw (embedding vector_cosine_ops);
Query-time recall tuning
-- Higher ef_search = better recall, slower queries (default 40)
SET hnsw.ef_search = 100;
-- For IVFFlat: higher probes = better recall (default 1)
SET ivfflat.probes = 10;
Partial indexes for multi-tenant or multi-language setups
-- Separate index per language for faster filtered queries
CREATE INDEX ON document_chunks USING hnsw (embedding vector_cosine_ops)
WHERE document_id IN (SELECT id FROM documents WHERE lang = 'ja');
Monitor index health
-- Check index size
SELECT pg_size_pretty(pg_relation_size('document_chunks_embedding_idx'));
-- Check query plan is using the index
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, content, embedding <=> '[0.1, 0.2, ...]'::vector AS distance
FROM document_chunks
ORDER BY distance LIMIT 5;
If EXPLAIN shows Seq Scan instead of Index Scan, the planner thinks a sequential scan is cheaper (common with small tables). Add SET enable_seqscan = off to force index usage during testing.
FAQ
How many vectors can pgvector handle?
In practice, pgvector handles 1–5 million vectors well on standard hardware (16 GB RAM, SSD). Beyond that, query latency starts climbing. At 10M+ vectors, consider Qdrant or Weaviate for the vector layer while keeping your relational data in Postgres.
Should I use cosine similarity or L2 distance?
Cosine similarity (<=>) is better for text embeddings — it measures angle between vectors, ignoring magnitude. L2 distance (<->) measures absolute distance, which matters more for image embeddings. For RAG with text, always use cosine.
Can I store multilingual embeddings in the same table?
Yes, as long as the embedding model is multilingual. BGE-M3 and multilingual-e5-large work well for Thai, Japanese, and Chinese. OpenAI’s text-embedding-3-small also handles multilingual content well. Store language in a lang column and filter at query time.
What chunk size should I use?
500 tokens with 50-token overlap is a good starting point. Too small (< 100 tokens) loses context; too large (> 1,000 tokens) dilutes relevance and hits LLM context limits. For structured documents (legal contracts, technical specs), chunk at section boundaries rather than fixed token counts.
How do I keep embeddings in sync when documents change?
Delete the old chunks and re-embed the updated document. A simple approach:
DELETE FROM document_chunks WHERE document_id = %s;
-- Then re-ingest with fresh embeddings
For high-frequency updates, consider a queue (Celery, BullMQ) that processes document changes asynchronously.
pgvector vs Pinecone — which should I choose?
If you’re already on Postgres: pgvector. You get SQL JOINs, transactions, one fewer infrastructure component, and zero per-query pricing. If you’re starting fresh and expect 10M+ vectors or need managed SLAs: Pinecone or Qdrant. For most enterprise RAG projects in the <5M vector range, pgvector is the pragmatic choice.
Production checklist
Before going live:
- [ ] HNSW or IVFFlat index created on the embedding column
- [ ]
maintenance_work_memset to at least 512MB for index builds - [ ]
hnsw.ef_searchtuned (start at 100, benchmark recall vs latency) - [ ] Partial indexes per language or tenant if filtering heavily
- [ ] Connection pooling (PgBouncer) in front of Postgres
- [ ] Embedding generation async (queue-based, not inline with user requests)
- [ ]
min_similaritythreshold set (0.7–0.8) to filter low-quality matches - [ ] Monitoring on
pg_stat_activityfor slow vector queries - [ ] Regular
VACUUM ANALYZEondocument_chunksafter bulk ingestion
Next steps
This tutorial covers the retrieval layer. To complete your RAG stack:
- For Thai and Japanese document ingestion with multilingual embeddings — see our guide: LlamaIndex + pgvector: Production RAG for Thai and Japanese Business Documents
- For choosing hardware to run local embedding models (no OpenAI dependency) — see: Choosing Hardware for Local LLMs in 2026
- For understanding how RAG fits the broader private AI architecture — see: Private AI vs ChatGPT: What’s the Difference and Which Does Your Business Need?
Building a RAG system for your internal documents, compliance workflows, or customer support? Contact us at hello@simplico.net — we deploy production RAG stacks for enterprise clients across Thailand and Japan.
Latest Posts
- Your Staff Have 24 Passwords. Your Business Has 24 Attack Surfaces. June 11, 2026
- The Security Risk Sitting Quietly in Your Engineering Org June 8, 2026
- SOAR and Alert Fatigue: Why Your SOC Is Drowning in Alerts (and How Automation Actually Helps) June 7, 2026
- MES vs ERP: What’s the Difference and Which Does Your Factory Actually Need? June 7, 2026
- React Native vs Flutter in 2026: How to Actually Choose June 4, 2026
- React Native in 2026: Is It Still Worth Building With? June 3, 2026
