LlamaIndex + pgvector: Production RAG for Thai and Japanese Business Documents
Most RAG demos work. Most RAG production deployments fail — quietly, expensively, and in ways that are hard to debug.
After building simpliDoc, Simplico’s multilingual AI document intelligence platform, we learned the gap between a working prototype and a production system handling Thai, Japanese, and English enterprise documents simultaneously. This post shares what we actually deployed: the stack, the config values we landed on after testing, the failure modes we hit, and the fixes.
Why this stack
At Simplico we use LlamaIndex as the orchestration layer and pgvector (PostgreSQL extension) as the vector store. Here’s why these two tools together, and not alternatives:
- LlamaIndex has first-class support for multilingual embeddings and handles chunking strategies that work across Thai, Japanese, and English — languages with fundamentally different tokenization behavior.
- pgvector runs inside PostgreSQL, which means your vector data lives in the same database as your business data. No additional infrastructure, no synchronization complexity, no separate service to operate. For Thai and Japanese enterprise clients with strict data residency requirements (PDPA, 個人情報保護法), keeping everything in one Postgres instance on local infrastructure is a significant compliance advantage.
System architecture
flowchart TD
A["Document upload\n(PDF / DOCX / TXT)"] --> B["LlamaIndex\nDocument parser"]
B --> C["Language detector\n(langdetect)"]
C --> D["multilingual-e5-large\nEmbedding model"]
D --> E["pgvector\n(PostgreSQL)"]
F["User query"] --> G["FastAPI\nRAG endpoint"]
G --> D
G --> E
E --> H["Top-k retrieval\n(cosine similarity)"]
H --> I["Claude API\nAnswer generation"]
I --> J["Streaming response\n(SSE)"]
Step 1: PostgreSQL + pgvector setup
-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Document chunks table
CREATE TABLE document_chunks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
doc_id UUID NOT NULL,
chunk_index INTEGER NOT NULL,
content TEXT NOT NULL,
language VARCHAR(10), -- 'th', 'ja', 'zh', 'en'
embedding VECTOR(1024), -- multilingual-e5-large output dimension
metadata JSONB,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- HNSW index for fast approximate nearest neighbour search
-- Critical at >50k chunks: without this, queries go from ~8ms to 4+ seconds
CREATE INDEX ON document_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
Production tip: Add the HNSW index before you go live, not after. We made the mistake of adding it post-launch at 80k chunks. The index build took 4 hours and caused elevated query latency during construction.
Step 2: Embedding model
We use multilingual-e5-large (1024 dimensions) from HuggingFace. It handles Thai, Japanese, Simplified Chinese, and English with a single model — no language-specific models to manage.
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(
model_name="intfloat/multilingual-e5-large",
max_length=512,
device="cpu", # GPU if available; we run CPU on 4-core VM
)
Throughput on our deployment: ~60 chunks/second on a 4-core CPU VM. For a 200-page PDF split into ~600 chunks, ingestion takes roughly 10 seconds.
Step 3: Chunking strategy
This is where most RAG projects get it wrong. Thai and Japanese have no word-boundary spaces, which means character-count chunking produces very different results than it does for English.
from llama_index.core.node_parser import SentenceSplitter
splitter = SentenceSplitter(
chunk_size=400, # characters, not tokens
chunk_overlap=80, # 20% overlap
paragraph_separator="\n\n",
)
What we tested and why we landed on 400/80:
| Chunk size | Overlap | Result |
|---|---|---|
| 256 / 50 | Too small | Thai sentences split mid-clause; retrieval missed context |
| 512 / 100 | Medium | Good for English; Thai/Japanese still fragmented |
| 400 / 80 | Our choice | Best retrieval quality across all three languages |
| 800 / 160 | Too large | Retrieval quality fine; pgvector cosine scores less discriminative |
Before (chunk size 256, Thai contract text):
Retrieved chunk: "...อัตราดอกเบี้ย ร้อยละสิบห้าต่อปี ในกรณีที่ผู้กู้ผิดนัดชำระ..."
Answer: Interest rate is 15% per year.
Missing: The default conditions were in the next chunk and not retrieved.
After (chunk size 400, same document):
Retrieved chunk: "...อัตราดอกเบี้ย ร้อยละสิบห้าต่อปี ในกรณีที่ผู้กู้ผิดนัดชำระหนี้เกินกว่าสามสิบวัน..."
Answer: Interest rate is 15% per year, applicable when payment is overdue by more than 30 days.
Step 4: Ingestion pipeline
import asyncio
from llama_index.core import Document, VectorStoreIndex
from llama_index.vector_stores.postgres import PGVectorStore
import psycopg2
async def ingest_document(file_path: str, doc_id: str, language: str):
# Parse document
with open(file_path, "rb") as f:
raw_text = extract_text(f) # your PDF/DOCX parser
doc = Document(
text=raw_text,
metadata={"doc_id": doc_id, "language": language}
)
# Vector store connection
vector_store = PGVectorStore.from_params(
database="simplidoc",
host="localhost",
port=5432,
user="simplidoc_user",
password=os.environ["DB_PASSWORD"],
table_name="document_chunks",
embed_dim=1024,
)
# Build index
index = VectorStoreIndex.from_documents(
[doc],
embed_model=embed_model,
transformations=[splitter],
vector_store=vector_store,
)
return index
Step 5: FastAPI RAG endpoint
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from anthropic import Anthropic
import asyncio
app = FastAPI()
client = Anthropic()
@app.post("/query")
async def query_documents(request: QueryRequest):
# Retrieve top-k relevant chunks
query_engine = index.as_query_engine(
similarity_top_k=5,
streaming=True,
)
# Build context from retrieved chunks
retrieval = await query_engine.aretrieve(request.query)
context = "\n\n---\n\n".join([node.text for node in retrieval])
system_prompt = """You are a document assistant for enterprise business documents.
Answer questions based only on the provided context.
If the answer is not in the context, say so clearly.
Respond in the same language as the question."""
async def stream_response():
with client.messages.stream(
model="claude-sonnet-4-20250514",
max_tokens=1000,
system=system_prompt,
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {request.query}"
}]
) as stream:
for text in stream.text_stream:
yield f"data: {text}\n\n"
return StreamingResponse(stream_response(), media_type="text/event-stream")
Production failure modes we hit
1. Embedding model cold start (12-second delay on first query)
The multilingual-e5-large model loads into RAM on first use. On a cold VM, this caused a 12-second delay on the first query of each session.
Fix: Warm the model at startup.
@app.on_event("startup")
async def warm_embedding_model():
_ = embed_model.get_text_embedding("warmup")
2. Thai PDF extraction producing garbled text
Some Thai PDFs use non-standard font encoding. PyPDF2 extracted garbage characters. Switch to pdfplumber for Thai documents.
import pdfplumber
def extract_thai_pdf(path: str) -> str:
with pdfplumber.open(path) as pdf:
return "\n".join(page.extract_text() or "" for page in pdf.pages)
3. pgvector cosine similarity returning irrelevant chunks at low thresholds
We initially returned any chunk above 0.5 cosine similarity. Some irrelevant chunks were scoring 0.55–0.65 for vague questions.
Fix: Raise the threshold and add a reranking step.
retrieval = await query_engine.aretrieve(request.query)
# Filter below 0.72 similarity
filtered = [n for n in retrieval if n.score >= 0.72]
Cost and performance
Running on a single 4-core / 8GB RAM VM (approximately ฿1,200/month on a Thai cloud provider):
| Metric | Value |
|---|---|
| Average query latency (warm) | 1.8 seconds end-to-end |
| Embedding throughput | ~60 chunks/second |
| pgvector search (HNSW, 200k chunks) | ~8ms |
| Claude API cost per query | ~฿0.004 at our usage level |
| Documents indexed | 12,000+ across Thai/Japanese/English |
Data residency notes
For Thai enterprise deployments under PDPA: all embeddings and document content remain in your PostgreSQL instance. No document content is sent to external embedding APIs — multilingual-e5-large runs locally. Only the retrieved context (not the full document) is sent to the Claude API for answer generation.
For Japanese enterprise deployments under 個人情報保護法: same architecture applies. If stricter data residency is required, Claude API can be replaced with a locally-hosted model for the answer generation step.
What to read next
- Building a Modern Cybersecurity Monitoring & Response System
- Understanding Wazuh: Architecture, Use Cases, and Real-World Applications
Need a multilingual RAG system for your enterprise documents? Contact Simplico — we’ve built this for Thai, Japanese, and global clients.
Get in Touch with us
Related Posts
- simpliShop:专为泰国市场打造的按需定制多语言电商平台
- simpliShop: The Thai E-Commerce Platform for Made-to-Order and Multi-Language Stores
- ERP项目为何失败(以及如何让你的项目成功)
- Why ERP Projects Fail (And How to Make Yours Succeed)
- Payment API幂等性设计:用Stripe、支付宝、微信支付和2C2P防止重复扣款
- Idempotency in Payment APIs: Prevent Double Charges with Stripe, Omise, and 2C2P
- Agentic AI in SOC Workflows: Beyond Playbooks, Into Autonomous Defense (2026 Guide)
- 从零构建SOC:Wazuh + IRIS-web 真实项目实战报告
- Building a SOC from Scratch: A Real-World Wazuh + IRIS-web Field Report
- 中国品牌出海东南亚:支付、物流与ERP全链路集成技术方案
- 再生资源工厂管理系统:中国回收企业如何在不知不觉中蒙受损失
- 如何将电商平台与ERP系统打通:实战指南(2026年版)
- AI 编程助手到底在用哪些工具?(Claude Code、Codex CLI、Aider 深度解析)
- 使用 Wazuh + 开源工具构建轻量级 SOC:实战指南(2026年版)
- 能源管理软件的ROI:企业电费真的能降低15–40%吗?
- The ROI of Smart Energy: How Software Is Cutting Costs for Forward-Thinking Businesses
- How to Build a Lightweight SOC Using Wazuh + Open Source
- How to Connect Your Ecommerce Store to Your ERP: A Practical Guide (2026)
- What Tools Do AI Coding Assistants Actually Use? (Claude Code, Codex CLI, Aider)
- How to Improve Fuel Economy: The Physics of High Load, Low RPM Driving













