AI

pgvector Tutorial: Add Vector Search to PostgreSQL for RAG and Semantic Search

If you’re building a RAG pipeline or any application that needs semantic search, you’ll eventually need to decide where to store your embeddings. Dedicated vector databases (Pinecone, Qdrant, Weaviate) are one option. But for most teams — especially those already running PostgreSQL — pgvector is the faster, cheaper, and operationally simpler path.

pgvector is an open-source PostgreSQL extension that adds a vector data type, similarity search operators, and HNSW/IVFFlat indexes to your existing Postgres database. Your embeddings live in the same database as your application data, queryable with standard SQL, backed by Postgres ACID guarantees.

This tutorial covers everything from installation to production-ready RAG queries — with real code throughout.

What you’ll build: A working semantic search system that stores document embeddings in PostgreSQL and retrieves the most relevant chunks for a given query — the retrieval layer of a RAG pipeline.

Stack:

  • PostgreSQL 15+ with pgvector 0.7+
  • Python 3.11
  • openai or ollama for embeddings
  • psycopg2 / asyncpg for database access

What is pgvector and why use it?

A vector embedding is a list of floating-point numbers — typically 768 to 3,072 dimensions — that encodes the semantic meaning of a piece of text, image, or other data. Two pieces of text with similar meaning produce vectors that are close together in that high-dimensional space.

pgvector adds the ability to store these vectors in Postgres and run distance calculations across them efficiently:

Operation SQL
Cosine similarity embedding <=> query_vector
L2 distance embedding <-> query_vector
Inner product embedding <#> query_vector

When pgvector is the right choice:

  • You’re already running PostgreSQL
  • Your vector count is under ~5 million
  • You want to JOIN embeddings with relational data (user IDs, document metadata, access control)
  • You need ACID transactions across your app data and embeddings
  • You want to avoid managing another infrastructure component

When to use a dedicated vector DB instead:

  • 10M+ vectors with sub-10ms latency requirements
  • You need advanced filtering at massive scale (Qdrant handles this better)
  • Your team has no existing Postgres infrastructure

For most ASEAN and Japan enterprise RAG projects — internal document search, contract retrieval, SOC alert enrichment — pgvector is sufficient and significantly simpler to operate.


Architecture overview

flowchart TD
    A["Documents\n(PDF, text, DB records)"] --> B["Chunking\n(split into ~500 token segments)"]
    B --> C["Embedding model\n(OpenAI / Ollama / local)"]
    C --> D["pgvector\n(PostgreSQL)"]
    E["User query"] --> F["Embed query"]
    F --> G["Similarity search\n(cosine / L2)"]
    D --> G
    G --> H["Top-k chunks"]
    H --> I["LLM\n(GPT / Claude / local)"]
    I --> J["Answer with citations"]

Step 1 — Install pgvector

Option A: Docker (recommended for development)

docker run -d \
  --name pgvector-dev \
  -e POSTGRES_PASSWORD=secret \
  -e POSTGRES_DB=ragdb \
  -p 5432:5432 \
  pgvector/pgvector:pg16

This image ships with pgvector pre-installed. No extension build needed.

Option B: Install on existing PostgreSQL

Ubuntu / Debian:

sudo apt install postgresql-16-pgvector

macOS (Homebrew):

brew install pgvector

From source (any platform):

cd /tmp
git clone https://github.com/pgvector/pgvector.git
cd pgvector
make
sudo make install

Enable the extension in your database

-- Connect to your target database
CREATE EXTENSION IF NOT EXISTS vector;

-- Verify installation
SELECT * FROM pg_extension WHERE extname = 'vector';

Step 2 — Create the schema

-- Documents table: stores source text and metadata
CREATE TABLE documents (
    id          BIGSERIAL PRIMARY KEY,
    title       TEXT NOT NULL,
    source_url  TEXT,
    lang        VARCHAR(5) DEFAULT 'en',
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

-- Chunks table: stores text segments and their embeddings
CREATE TABLE document_chunks (
    id          BIGSERIAL PRIMARY KEY,
    document_id BIGINT REFERENCES documents(id) ON DELETE CASCADE,
    chunk_index INTEGER NOT NULL,
    content     TEXT NOT NULL,
    token_count INTEGER,
    embedding   vector(1536),  -- dimension must match your embedding model
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

Embedding dimensions by model:

Model Dimensions
OpenAI text-embedding-3-small 1536
OpenAI text-embedding-3-large 3072
OpenAI text-embedding-ada-002 1536
Ollama nomic-embed-text 768
Ollama mxbai-embed-large 1024
BGE-M3 (multilingual) 1024

The vector(1536) column type must match exactly. You cannot mix dimensions in the same column.


Step 3 — Ingest documents and generate embeddings

import os
import psycopg2
from openai import OpenAI
import tiktoken

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

DB_URL = "postgresql://postgres:secret@localhost:5432/ragdb"

def chunk_text(text: str, max_tokens: int = 500, overlap: int = 50) -> list[str]:
    """Split text into overlapping chunks by token count."""
    enc = tiktoken.get_encoding("cl100k_base")
    tokens = enc.encode(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + max_tokens, len(tokens))
        chunk_tokens = tokens[start:end]
        chunks.append(enc.decode(chunk_tokens))
        start += max_tokens - overlap  # overlap for context continuity
    return chunks

def get_embedding(text: str) -> list[float]:
    """Get embedding vector from OpenAI."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text.replace("\n", " ")  # newlines reduce embedding quality
    )
    return response.data[0].embedding

def ingest_document(title: str, content: str, source_url: str = None, lang: str = "en"):
    """Chunk a document, embed each chunk, and store in PostgreSQL."""
    conn = psycopg2.connect(DB_URL)
    cur = conn.cursor()

    try:
        # Insert document record
        cur.execute(
            "INSERT INTO documents (title, source_url, lang) VALUES (%s, %s, %s) RETURNING id",
            (title, source_url, lang)
        )
        doc_id = cur.fetchone()[0]

        # Chunk and embed
        chunks = chunk_text(content)
        print(f"Ingesting '{title}': {len(chunks)} chunks")

        for i, chunk in enumerate(chunks):
            embedding = get_embedding(chunk)
            cur.execute(
                """
                INSERT INTO document_chunks (document_id, chunk_index, content, token_count, embedding)
                VALUES (%s, %s, %s, %s, %s)
                """,
                (doc_id, i, chunk, len(chunk.split()), embedding)
            )

        conn.commit()
        print(f"Done: document_id={doc_id}")
        return doc_id

    except Exception as e:
        conn.rollback()
        raise e
    finally:
        cur.close()
        conn.close()

# Usage
ingest_document(
    title="Company Security Policy v3.2",
    content=open("security_policy.txt").read(),
    source_url="https://internal.company.com/policies/security",
    lang="en"
)

Using Ollama for local embeddings (no API key required)

import httpx

def get_embedding_ollama(text: str, model: str = "nomic-embed-text") -> list[float]:
    """Get embedding from local Ollama instance."""
    response = httpx.post(
        "http://localhost:11434/api/embeddings",
        json={"model": model, "prompt": text}
    )
    return response.json()["embedding"]

Replace get_embedding() with get_embedding_ollama() throughout. Change the vector dimension to 768 in your schema if using nomic-embed-text.


Step 4 — Create indexes for fast similarity search

Without an index, pgvector performs exact nearest-neighbor search — accurate but O(n) for every query. For production, you need an approximate nearest-neighbor (ANN) index.

pgvector supports two index types:

HNSW (recommended for most cases)

-- Create HNSW index for cosine similarity
CREATE INDEX ON document_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

HNSW parameters:

Parameter Default Effect
m 16 Connections per node. Higher = better recall, more memory
ef_construction 64 Build-time search width. Higher = better recall, slower build

For most RAG use cases, the defaults are fine. Increase m=32, ef_construction=128 if recall is critical (legal documents, compliance search).

IVFFlat (better for very large datasets)

-- First, you need data in the table before creating this index
CREATE INDEX ON document_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

lists should be roughly sqrt(row_count). IVFFlat builds faster than HNSW but has lower recall by default. Tune SET ivfflat.probes = 10 at query time to trade speed for recall.

Which to choose:

HNSW IVFFlat
Recall Higher Lower (tunable)
Build time Slower Faster
Memory Higher Lower
Best for <2M vectors, high recall >2M vectors, batch ingestion

For most RAG projects, use HNSW.


Step 5 — Run similarity search queries

def semantic_search(
    query: str,
    top_k: int = 5,
    lang_filter: str = None,
    min_similarity: float = 0.7
) -> list[dict]:
    """Find the most semantically similar chunks to a query."""

    query_embedding = get_embedding(query)

    conn = psycopg2.connect(DB_URL)
    cur = conn.cursor()

    try:
        # Set ef_search higher than default for better recall
        cur.execute("SET hnsw.ef_search = 100")

        sql = """
            SELECT
                dc.id,
                dc.content,
                dc.chunk_index,
                d.title,
                d.source_url,
                d.lang,
                1 - (dc.embedding <=> %s::vector) AS similarity
            FROM document_chunks dc
            JOIN documents d ON d.id = dc.document_id
            WHERE 1 - (dc.embedding <=> %s::vector) >= %s
            {lang_clause}
            ORDER BY dc.embedding <=> %s::vector
            LIMIT %s
        """.format(
            lang_clause="AND d.lang = %s" if lang_filter else ""
        )

        params = [query_embedding, query_embedding, min_similarity]
        if lang_filter:
            params.append(lang_filter)
        params.extend([query_embedding, top_k])

        cur.execute(sql, params)
        rows = cur.fetchall()

        return [
            {
                "id": row[0],
                "content": row[1],
                "chunk_index": row[2],
                "title": row[3],
                "source_url": row[4],
                "lang": row[5],
                "similarity": float(row[6])
            }
            for row in rows
        ]
    finally:
        cur.close()
        conn.close()

# Usage
results = semantic_search(
    query="What is the password policy for privileged accounts?",
    top_k=5,
    lang_filter="en",
    min_similarity=0.75
)

for r in results:
    print(f"[{r['similarity']:.3f}] {r['title']} — chunk {r['chunk_index']}")
    print(f"  {r['content'][:200]}...")
    print()

Step 6 — Build the full RAG pipeline

Combine retrieval with an LLM to answer questions grounded in your documents:

def rag_query(
    question: str,
    lang: str = "en",
    top_k: int = 5,
    model: str = "gpt-4o-mini"
) -> dict:
    """Full RAG pipeline: retrieve relevant chunks and generate a grounded answer."""

    # Step 1: retrieve relevant chunks
    chunks = semantic_search(question, top_k=top_k, lang_filter=lang)

    if not chunks:
        return {
            "answer": "No relevant documents found for this question.",
            "sources": [],
            "chunks_used": 0
        }

    # Step 2: build context string with source citations
    context_parts = []
    sources = []
    for i, chunk in enumerate(chunks):
        context_parts.append(f"[{i+1}] {chunk['content']}")
        sources.append({
            "title": chunk["title"],
            "url": chunk["source_url"],
            "similarity": chunk["similarity"]
        })

    context = "\n\n".join(context_parts)

    # Step 3: generate answer with LLM
    system_prompt = """You are a helpful assistant. Answer the question using ONLY the provided context.
If the context does not contain the answer, say "I cannot find this information in the available documents."
Always cite the source number [1], [2], etc. when referencing information."""

    user_prompt = f"""Context:
{context}

Question: {question}

Answer:"""

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0.1  # low temperature for factual grounding
    )

    return {
        "answer": response.choices[0].message.content,
        "sources": sources,
        "chunks_used": len(chunks)
    }

# Usage
result = rag_query(
    question="What are the password requirements for admin accounts?",
    lang="en"
)

print(result["answer"])
print("\nSources:")
for s in result["sources"]:
    print(f"  - {s['title']} (similarity: {s['similarity']:.3f})")

Step 7 — Hybrid search (vector + full-text)

Pure vector search sometimes misses exact keyword matches. Hybrid search combines vector similarity with PostgreSQL full-text search for better recall:

-- Add a full-text search column
ALTER TABLE document_chunks ADD COLUMN content_tsv tsvector
    GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;

CREATE INDEX ON document_chunks USING gin(content_tsv);
def hybrid_search(query: str, top_k: int = 10, rrf_k: int = 60) -> list[dict]:
    """
    Reciprocal Rank Fusion: combines vector search and full-text search rankings.
    RRF score = 1/(k + rank_vector) + 1/(k + rank_fts)
    """
    query_embedding = get_embedding(query)

    conn = psycopg2.connect(DB_URL)
    cur = conn.cursor()

    try:
        cur.execute("SET hnsw.ef_search = 100")
        cur.execute("""
            WITH vector_ranked AS (
                SELECT id, content, document_id,
                       ROW_NUMBER() OVER (ORDER BY embedding <=> %s::vector) AS rank
                FROM document_chunks
                ORDER BY embedding <=> %s::vector
                LIMIT 20
            ),
            fts_ranked AS (
                SELECT id, content, document_id,
                       ROW_NUMBER() OVER (ORDER BY ts_rank(content_tsv, query) DESC) AS rank
                FROM document_chunks,
                     plainto_tsquery('english', %s) query
                WHERE content_tsv @@ query
                LIMIT 20
            ),
            rrf AS (
                SELECT
                    COALESCE(v.id, f.id) AS id,
                    COALESCE(v.content, f.content) AS content,
                    COALESCE(v.document_id, f.document_id) AS document_id,
                    COALESCE(1.0 / (%s + v.rank), 0) +
                    COALESCE(1.0 / (%s + f.rank), 0) AS rrf_score
                FROM vector_ranked v
                FULL OUTER JOIN fts_ranked f ON v.id = f.id
            )
            SELECT r.id, r.content, d.title, d.source_url, r.rrf_score
            FROM rrf r
            JOIN documents d ON d.id = r.document_id
            ORDER BY rrf_score DESC
            LIMIT %s
        """, [query_embedding, query_embedding, query, rrf_k, rrf_k, top_k])

        return [
            {"id": row[0], "content": row[1], "title": row[2],
             "source_url": row[3], "rrf_score": float(row[4])}
            for row in cur.fetchall()
        ]
    finally:
        cur.close()
        conn.close()

Hybrid search improves recall by 15–30% on domain-specific corpora where users mix keyword and semantic queries. Use it when your documents contain product codes, regulation numbers, or proper nouns that embeddings handle poorly.


Performance tuning

Index build memory

-- Increase memory for faster HNSW index builds
SET maintenance_work_mem = '1GB';
CREATE INDEX ON document_chunks USING hnsw (embedding vector_cosine_ops);

Query-time recall tuning

-- Higher ef_search = better recall, slower queries (default 40)
SET hnsw.ef_search = 100;

-- For IVFFlat: higher probes = better recall (default 1)
SET ivfflat.probes = 10;

Partial indexes for multi-tenant or multi-language setups

-- Separate index per language for faster filtered queries
CREATE INDEX ON document_chunks USING hnsw (embedding vector_cosine_ops)
WHERE document_id IN (SELECT id FROM documents WHERE lang = 'ja');

Monitor index health

-- Check index size
SELECT pg_size_pretty(pg_relation_size('document_chunks_embedding_idx'));

-- Check query plan is using the index
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, content, embedding <=> '[0.1, 0.2, ...]'::vector AS distance
FROM document_chunks
ORDER BY distance LIMIT 5;

If EXPLAIN shows Seq Scan instead of Index Scan, the planner thinks a sequential scan is cheaper (common with small tables). Add SET enable_seqscan = off to force index usage during testing.


FAQ

How many vectors can pgvector handle?

In practice, pgvector handles 1–5 million vectors well on standard hardware (16 GB RAM, SSD). Beyond that, query latency starts climbing. At 10M+ vectors, consider Qdrant or Weaviate for the vector layer while keeping your relational data in Postgres.

Should I use cosine similarity or L2 distance?

Cosine similarity (<=>) is better for text embeddings — it measures angle between vectors, ignoring magnitude. L2 distance (<->) measures absolute distance, which matters more for image embeddings. For RAG with text, always use cosine.

Can I store multilingual embeddings in the same table?

Yes, as long as the embedding model is multilingual. BGE-M3 and multilingual-e5-large work well for Thai, Japanese, and Chinese. OpenAI’s text-embedding-3-small also handles multilingual content well. Store language in a lang column and filter at query time.

What chunk size should I use?

500 tokens with 50-token overlap is a good starting point. Too small (< 100 tokens) loses context; too large (> 1,000 tokens) dilutes relevance and hits LLM context limits. For structured documents (legal contracts, technical specs), chunk at section boundaries rather than fixed token counts.

How do I keep embeddings in sync when documents change?

Delete the old chunks and re-embed the updated document. A simple approach:

DELETE FROM document_chunks WHERE document_id = %s;
-- Then re-ingest with fresh embeddings

For high-frequency updates, consider a queue (Celery, BullMQ) that processes document changes asynchronously.

pgvector vs Pinecone — which should I choose?

If you’re already on Postgres: pgvector. You get SQL JOINs, transactions, one fewer infrastructure component, and zero per-query pricing. If you’re starting fresh and expect 10M+ vectors or need managed SLAs: Pinecone or Qdrant. For most enterprise RAG projects in the <5M vector range, pgvector is the pragmatic choice.


Production checklist

Before going live:

  • [ ] HNSW or IVFFlat index created on the embedding column
  • [ ] maintenance_work_mem set to at least 512MB for index builds
  • [ ] hnsw.ef_search tuned (start at 100, benchmark recall vs latency)
  • [ ] Partial indexes per language or tenant if filtering heavily
  • [ ] Connection pooling (PgBouncer) in front of Postgres
  • [ ] Embedding generation async (queue-based, not inline with user requests)
  • [ ] min_similarity threshold set (0.7–0.8) to filter low-quality matches
  • [ ] Monitoring on pg_stat_activity for slow vector queries
  • [ ] Regular VACUUM ANALYZE on document_chunks after bulk ingestion

Next steps

This tutorial covers the retrieval layer. To complete your RAG stack:

  • For Thai and Japanese document ingestion with multilingual embeddings — see our guide: LlamaIndex + pgvector: Production RAG for Thai and Japanese Business Documents
  • For choosing hardware to run local embedding models (no OpenAI dependency) — see: Choosing Hardware for Local LLMs in 2026
  • For understanding how RAG fits the broader private AI architecture — see: Private AI vs ChatGPT: What’s the Difference and Which Does Your Business Need?

Building a RAG system for your internal documents, compliance workflows, or customer support? Contact us at hello@simplico.net — we deploy production RAG stacks for enterprise clients across Thailand and Japan.