LlamaIndex + pgvector: Production RAG for Thai and Japanese Business Documents
Most RAG demos work. Most RAG production deployments fail — quietly, expensively, and in ways that are hard to debug.
After building simpliDoc, Simplico’s multilingual AI document intelligence platform, we learned the gap between a working prototype and a production system handling Thai, Japanese, and English enterprise documents simultaneously. This post shares what we actually deployed: the stack, the config values we landed on after testing, the failure modes we hit, and the fixes.
Why this stack
At Simplico we use LlamaIndex as the orchestration layer and pgvector (PostgreSQL extension) as the vector store. Here’s why these two tools together, and not alternatives:
- LlamaIndex has first-class support for multilingual embeddings and handles chunking strategies that work across Thai, Japanese, and English — languages with fundamentally different tokenization behavior.
- pgvector runs inside PostgreSQL, which means your vector data lives in the same database as your business data. No additional infrastructure, no synchronization complexity, no separate service to operate. For Thai and Japanese enterprise clients with strict data residency requirements (PDPA, 個人情報保護法), keeping everything in one Postgres instance on local infrastructure is a significant compliance advantage.
System architecture
flowchart TD
A["Document upload\n(PDF / DOCX / TXT)"] --> B["LlamaIndex\nDocument parser"]
B --> C["Language detector\n(langdetect)"]
C --> D["multilingual-e5-large\nEmbedding model"]
D --> E["pgvector\n(PostgreSQL)"]
F["User query"] --> G["FastAPI\nRAG endpoint"]
G --> D
G --> E
E --> H["Top-k retrieval\n(cosine similarity)"]
H --> I["Claude API\nAnswer generation"]
I --> J["Streaming response\n(SSE)"]
Step 1: PostgreSQL + pgvector setup
-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Document chunks table
CREATE TABLE document_chunks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
doc_id UUID NOT NULL,
chunk_index INTEGER NOT NULL,
content TEXT NOT NULL,
language VARCHAR(10), -- 'th', 'ja', 'zh', 'en'
embedding VECTOR(1024), -- multilingual-e5-large output dimension
metadata JSONB,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- HNSW index for fast approximate nearest neighbour search
-- Critical at >50k chunks: without this, queries go from ~8ms to 4+ seconds
CREATE INDEX ON document_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
Production tip: Add the HNSW index before you go live, not after. We made the mistake of adding it post-launch at 80k chunks. The index build took 4 hours and caused elevated query latency during construction.
Step 2: Embedding model
We use multilingual-e5-large (1024 dimensions) from HuggingFace. It handles Thai, Japanese, Simplified Chinese, and English with a single model — no language-specific models to manage.
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(
model_name="intfloat/multilingual-e5-large",
max_length=512,
device="cpu", # GPU if available; we run CPU on 4-core VM
)
Throughput on our deployment: ~60 chunks/second on a 4-core CPU VM. For a 200-page PDF split into ~600 chunks, ingestion takes roughly 10 seconds.
Step 3: Chunking strategy
This is where most RAG projects get it wrong. Thai and Japanese have no word-boundary spaces, which means character-count chunking produces very different results than it does for English.
from llama_index.core.node_parser import SentenceSplitter
splitter = SentenceSplitter(
chunk_size=400, # characters, not tokens
chunk_overlap=80, # 20% overlap
paragraph_separator="\n\n",
)
What we tested and why we landed on 400/80:
| Chunk size | Overlap | Result |
|---|---|---|
| 256 / 50 | Too small | Thai sentences split mid-clause; retrieval missed context |
| 512 / 100 | Medium | Good for English; Thai/Japanese still fragmented |
| 400 / 80 | Our choice | Best retrieval quality across all three languages |
| 800 / 160 | Too large | Retrieval quality fine; pgvector cosine scores less discriminative |
Before (chunk size 256, Thai contract text):
Retrieved chunk: "...อัตราดอกเบี้ย ร้อยละสิบห้าต่อปี ในกรณีที่ผู้กู้ผิดนัดชำระ..."
Answer: Interest rate is 15% per year.
Missing: The default conditions were in the next chunk and not retrieved.
After (chunk size 400, same document):
Retrieved chunk: "...อัตราดอกเบี้ย ร้อยละสิบห้าต่อปี ในกรณีที่ผู้กู้ผิดนัดชำระหนี้เกินกว่าสามสิบวัน..."
Answer: Interest rate is 15% per year, applicable when payment is overdue by more than 30 days.
Step 4: Ingestion pipeline
import asyncio
from llama_index.core import Document, VectorStoreIndex
from llama_index.vector_stores.postgres import PGVectorStore
import psycopg2
async def ingest_document(file_path: str, doc_id: str, language: str):
# Parse document
with open(file_path, "rb") as f:
raw_text = extract_text(f) # your PDF/DOCX parser
doc = Document(
text=raw_text,
metadata={"doc_id": doc_id, "language": language}
)
# Vector store connection
vector_store = PGVectorStore.from_params(
database="simplidoc",
host="localhost",
port=5432,
user="simplidoc_user",
password=os.environ["DB_PASSWORD"],
table_name="document_chunks",
embed_dim=1024,
)
# Build index
index = VectorStoreIndex.from_documents(
[doc],
embed_model=embed_model,
transformations=[splitter],
vector_store=vector_store,
)
return index
Step 5: FastAPI RAG endpoint
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from anthropic import Anthropic
import asyncio
app = FastAPI()
client = Anthropic()
@app.post("/query")
async def query_documents(request: QueryRequest):
# Retrieve top-k relevant chunks
query_engine = index.as_query_engine(
similarity_top_k=5,
streaming=True,
)
# Build context from retrieved chunks
retrieval = await query_engine.aretrieve(request.query)
context = "\n\n---\n\n".join([node.text for node in retrieval])
system_prompt = """You are a document assistant for enterprise business documents.
Answer questions based only on the provided context.
If the answer is not in the context, say so clearly.
Respond in the same language as the question."""
async def stream_response():
with client.messages.stream(
model="claude-sonnet-4-20250514",
max_tokens=1000,
system=system_prompt,
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {request.query}"
}]
) as stream:
for text in stream.text_stream:
yield f"data: {text}\n\n"
return StreamingResponse(stream_response(), media_type="text/event-stream")
Production failure modes we hit
1. Embedding model cold start (12-second delay on first query)
The multilingual-e5-large model loads into RAM on first use. On a cold VM, this caused a 12-second delay on the first query of each session.
Fix: Warm the model at startup.
@app.on_event("startup")
async def warm_embedding_model():
_ = embed_model.get_text_embedding("warmup")
2. Thai PDF extraction producing garbled text
Some Thai PDFs use non-standard font encoding. PyPDF2 extracted garbage characters. Switch to pdfplumber for Thai documents.
import pdfplumber
def extract_thai_pdf(path: str) -> str:
with pdfplumber.open(path) as pdf:
return "\n".join(page.extract_text() or "" for page in pdf.pages)
3. pgvector cosine similarity returning irrelevant chunks at low thresholds
We initially returned any chunk above 0.5 cosine similarity. Some irrelevant chunks were scoring 0.55–0.65 for vague questions.
Fix: Raise the threshold and add a reranking step.
retrieval = await query_engine.aretrieve(request.query)
# Filter below 0.72 similarity
filtered = [n for n in retrieval if n.score >= 0.72]
Cost and performance
Running on a single 4-core / 8GB RAM VM (approximately ฿1,200/month on a Thai cloud provider):
| Metric | Value |
|---|---|
| Average query latency (warm) | 1.8 seconds end-to-end |
| Embedding throughput | ~60 chunks/second |
| pgvector search (HNSW, 200k chunks) | ~8ms |
| Claude API cost per query | ~฿0.004 at our usage level |
| Documents indexed | 12,000+ across Thai/Japanese/English |
Data residency notes
For Thai enterprise deployments under PDPA: all embeddings and document content remain in your PostgreSQL instance. No document content is sent to external embedding APIs — multilingual-e5-large runs locally. Only the retrieved context (not the full document) is sent to the Claude API for answer generation.
For Japanese enterprise deployments under 個人情報保護法: same architecture applies. If stricter data residency is required, Claude API can be replaced with a locally-hosted model for the answer generation step.
What to read next
- Building a Modern Cybersecurity Monitoring & Response System
- Understanding Wazuh: Architecture, Use Cases, and Real-World Applications
Need a multilingual RAG system for your enterprise documents? Contact Simplico — we’ve built this for Thai, Japanese, and global clients.
Get in Touch with us
Related Posts
- Tier-1 SOC アナリスト Agent を本番環境で動かす:Wazuh + Claude + Shuffle 実装の現場知見 なぜ「AI for SOC」のほとんどは機能しないのか — そして何が実際に機能するのか
- Building a Tier-1 SOC Analyst Agent: Wazuh + Claude + Shuffle in Production, Why “AI for SOC” mostly doesn’t work — and what does
- The Accounting Software Your Firm Uses Is Built for Your Clients, Not for You
- 2026年本地大模型(Local LLM)硬件选型实用指南
- Choosing Hardware for Local LLMs in 2026: A Practical Sizing Guide
- Why Your Finance Team Spends 40% of Their Week on Work AI Can Now Do
- 用纯开源方案搭建生产级 SOC:Wazuh + DFIR-IRIS + 自研集成层实战记录
- How We Built a Real Security Operations Center With Open-Source Tools
- FarmScript:我们如何从零设计一门农业IoT领域特定语言
- FarmScript: How We Designed a Programming Language for Chanthaburi Durian Farmers
- 智慧农业项目为何止步于试点阶段
- Why Smart Farming Projects Fail Before They Leave the Pilot Stage
- ERP项目为何总是超支、延期,最终令人失望
- ERP Projects: Why They Cost More, Take Longer, and Disappoint More Than Expected
- AI Security in Production: What Enterprise Teams Must Know in 2026
- 弹性无人机蜂群设计:具备安全通信的无领导者容错网状网络
- Designing Resilient Drone Swarms: Leaderless-Tolerant Mesh Networks with Secure Communications
- NumPy广播规则详解:为什么`(3,)`和`(3,1)`行为不同——以及它何时会悄悄给出错误答案
- NumPy Broadcasting Rules: Why `(3,)` and `(3,1)` Behave Differently — and When It Silently Gives Wrong Answers
- 关键基础设施遭受攻击:从乌克兰电网战争看工业IT/OT安全













