LlamaIndex + pgvector: Production RAG for Thai and Japanese Business Documents

Most RAG demos work. Most RAG production deployments fail — quietly, expensively, and in ways that are hard to debug.

After building simpliDoc, Simplico’s multilingual AI document intelligence platform, we learned the gap between a working prototype and a production system handling Thai, Japanese, and English enterprise documents simultaneously. This post shares what we actually deployed: the stack, the config values we landed on after testing, the failure modes we hit, and the fixes.

Why this stack

At Simplico we use LlamaIndex as the orchestration layer and pgvector (PostgreSQL extension) as the vector store. Here’s why these two tools together, and not alternatives:

LlamaIndex has first-class support for multilingual embeddings and handles chunking strategies that work across Thai, Japanese, and English — languages with fundamentally different tokenization behavior.
pgvector runs inside PostgreSQL, which means your vector data lives in the same database as your business data. No additional infrastructure, no synchronization complexity, no separate service to operate. For Thai and Japanese enterprise clients with strict data residency requirements (PDPA, 個人情報保護法), keeping everything in one Postgres instance on local infrastructure is a significant compliance advantage.

System architecture

flowchart TD
    A["Document upload\n(PDF / DOCX / TXT)"] --> B["LlamaIndex\nDocument parser"]
    B --> C["Language detector\n(langdetect)"]
    C --> D["multilingual-e5-large\nEmbedding model"]
    D --> E["pgvector\n(PostgreSQL)"]
    F["User query"] --> G["FastAPI\nRAG endpoint"]
    G --> D
    G --> E
    E --> H["Top-k retrieval\n(cosine similarity)"]
    H --> I["Claude API\nAnswer generation"]
    I --> J["Streaming response\n(SSE)"]

Step 1: PostgreSQL + pgvector setup

-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Document chunks table
CREATE TABLE document_chunks (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    doc_id      UUID NOT NULL,
    chunk_index INTEGER NOT NULL,
    content     TEXT NOT NULL,
    language    VARCHAR(10),           -- 'th', 'ja', 'zh', 'en'
    embedding   VECTOR(1024),          -- multilingual-e5-large output dimension
    metadata    JSONB,
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

-- HNSW index for fast approximate nearest neighbour search
-- Critical at >50k chunks: without this, queries go from ~8ms to 4+ seconds
CREATE INDEX ON document_chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

Production tip: Add the HNSW index before you go live, not after. We made the mistake of adding it post-launch at 80k chunks. The index build took 4 hours and caused elevated query latency during construction.

Step 2: Embedding model

We use multilingual-e5-large (1024 dimensions) from HuggingFace. It handles Thai, Japanese, Simplified Chinese, and English with a single model — no language-specific models to manage.

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(
    model_name="intfloat/multilingual-e5-large",
    max_length=512,
    device="cpu",        # GPU if available; we run CPU on 4-core VM
)

Throughput on our deployment: ~60 chunks/second on a 4-core CPU VM. For a 200-page PDF split into ~600 chunks, ingestion takes roughly 10 seconds.

Step 3: Chunking strategy

This is where most RAG projects get it wrong. Thai and Japanese have no word-boundary spaces, which means character-count chunking produces very different results than it does for English.

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=400,       # characters, not tokens
    chunk_overlap=80,     # 20% overlap
    paragraph_separator="\n\n",
)

What we tested and why we landed on 400/80:

Chunk size	Overlap	Result
256 / 50	Too small	Thai sentences split mid-clause; retrieval missed context
512 / 100	Medium	Good for English; Thai/Japanese still fragmented
400 / 80	Our choice	Best retrieval quality across all three languages
800 / 160	Too large	Retrieval quality fine; pgvector cosine scores less discriminative

Before (chunk size 256, Thai contract text):

Retrieved chunk: "...อัตราดอกเบี้ย ร้อยละสิบห้าต่อปี ในกรณีที่ผู้กู้ผิดนัดชำระ..."
Answer: Interest rate is 15% per year.
Missing: The default conditions were in the next chunk and not retrieved.

After (chunk size 400, same document):

Retrieved chunk: "...อัตราดอกเบี้ย ร้อยละสิบห้าต่อปี ในกรณีที่ผู้กู้ผิดนัดชำระหนี้เกินกว่าสามสิบวัน..."
Answer: Interest rate is 15% per year, applicable when payment is overdue by more than 30 days.

Step 4: Ingestion pipeline

import asyncio
from llama_index.core import Document, VectorStoreIndex
from llama_index.vector_stores.postgres import PGVectorStore
import psycopg2

async def ingest_document(file_path: str, doc_id: str, language: str):
    # Parse document
    with open(file_path, "rb") as f:
        raw_text = extract_text(f)  # your PDF/DOCX parser

    doc = Document(
        text=raw_text,
        metadata={"doc_id": doc_id, "language": language}
    )

    # Vector store connection
    vector_store = PGVectorStore.from_params(
        database="simplidoc",
        host="localhost",
        port=5432,
        user="simplidoc_user",
        password=os.environ["DB_PASSWORD"],
        table_name="document_chunks",
        embed_dim=1024,
    )

    # Build index
    index = VectorStoreIndex.from_documents(
        [doc],
        embed_model=embed_model,
        transformations=[splitter],
        vector_store=vector_store,
    )

    return index

Step 5: FastAPI RAG endpoint

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from anthropic import Anthropic
import asyncio

app = FastAPI()
client = Anthropic()

@app.post("/query")
async def query_documents(request: QueryRequest):
    # Retrieve top-k relevant chunks
    query_engine = index.as_query_engine(
        similarity_top_k=5,
        streaming=True,
    )

    # Build context from retrieved chunks
    retrieval = await query_engine.aretrieve(request.query)
    context = "\n\n---\n\n".join([node.text for node in retrieval])

    system_prompt = """You are a document assistant for enterprise business documents.
Answer questions based only on the provided context.
If the answer is not in the context, say so clearly.
Respond in the same language as the question."""

    async def stream_response():
        with client.messages.stream(
            model="claude-sonnet-4-20250514",
            max_tokens=1000,
            system=system_prompt,
            messages=[{
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {request.query}"
            }]
        ) as stream:
            for text in stream.text_stream:
                yield f"data: {text}\n\n"

    return StreamingResponse(stream_response(), media_type="text/event-stream")

Production failure modes we hit

1. Embedding model cold start (12-second delay on first query)

The multilingual-e5-large model loads into RAM on first use. On a cold VM, this caused a 12-second delay on the first query of each session.

Fix: Warm the model at startup.

@app.on_event("startup")
async def warm_embedding_model():
    _ = embed_model.get_text_embedding("warmup")

2. Thai PDF extraction producing garbled text

Some Thai PDFs use non-standard font encoding. PyPDF2 extracted garbage characters. Switch to pdfplumber for Thai documents.

import pdfplumber

def extract_thai_pdf(path: str) -> str:
    with pdfplumber.open(path) as pdf:
        return "\n".join(page.extract_text() or "" for page in pdf.pages)

3. pgvector cosine similarity returning irrelevant chunks at low thresholds

We initially returned any chunk above 0.5 cosine similarity. Some irrelevant chunks were scoring 0.55–0.65 for vague questions.

Fix: Raise the threshold and add a reranking step.

retrieval = await query_engine.aretrieve(request.query)
# Filter below 0.72 similarity
filtered = [n for n in retrieval if n.score >= 0.72]

Cost and performance

Running on a single 4-core / 8GB RAM VM (approximately ฿1,200/month on a Thai cloud provider):

Metric	Value
Average query latency (warm)	1.8 seconds end-to-end
Embedding throughput	~60 chunks/second
pgvector search (HNSW, 200k chunks)	~8ms
Claude API cost per query	~฿0.004 at our usage level
Documents indexed	12,000+ across Thai/Japanese/English

Data residency notes

For Thai enterprise deployments under PDPA: all embeddings and document content remain in your PostgreSQL instance. No document content is sent to external embedding APIs — multilingual-e5-large runs locally. Only the retrieved context (not the full document) is sent to the Claude API for answer generation.

For Japanese enterprise deployments under 個人情報保護法: same architecture applies. If stricter data residency is required, Claude API can be replaced with a locally-hosted model for the answer generation step.

LlamaIndex + pgvector: Production RAG for Thai and Japanese Business Documents

Why this stack

System architecture

Step 1: PostgreSQL + pgvector setup

Step 2: Embedding model

Step 3: Chunking strategy

Step 4: Ingestion pipeline

Step 5: FastAPI RAG endpoint

Production failure modes we hit

Cost and performance

Data residency notes

What to read next

Latest Posts

Why this stack

Related Services

System architecture

Step 1: PostgreSQL + pgvector setup

Step 2: Embedding model

Step 3: Chunking strategy

Step 4: Ingestion pipeline

Step 5: FastAPI RAG endpoint

Production failure modes we hit

Cost and performance

Data residency notes

Related Posts

What to read next

Latest Posts