อ่านบิลค่าไฟในเอเชียให้ได้คุณภาพระดับ audit: simpliDoc แก้ปัญหา PDF ใน CSRD ยังไง

Data source ที่ไม่มีใครอยากพูดถึง

เดินเข้าไปที่ฝ่ายธุรการของโรงงานในไทย, จีน, หรือญี่ปุ่น แล้วถามว่า data การใช้ไฟฟ้าเดือนที่แล้วอยู่ไหน จะไม่มีใครชี้ไปที่ระบบ จะถูกชี้ไปที่ folder — ทั้งกายภาพและดิจิทัล — ที่มี PDF จากการไฟฟ้าฯ ในพื้นที่ บางตัว computer-generated บางตัวเป็น scan ของบิลกระดาษที่มีรหัสโรงงานที่ฝ่ายธุรการเขียนด้วยมือเพิ่มเข้าไป บางตัวมีตราประทับ บางตัวเป็น double-page foldout ที่หน้าที่สอง scan มาไม่ตรง

นี่คือความเป็นจริงของ operations ในส่วนใหญ่ของอุตสาหกรรมการผลิตในเอเชีย และเป็นความเป็นจริงเดียวกันกับที่ ESRS E1-6 กำหนดให้คุณแปลงเป็นการเปิดเผย Scope 2 emission ที่ป้องกันได้ใน audit แยกตามทำเล แยกตามกิจกรรม พร้อม methodology ที่ documented refresh ทุก reporting period

Connector ของ CSRD reporting platform ไม่ได้อ่านไฟล์เหล่านี้ Line item integration ของ Big 4 สมมติเงียบๆ ว่า "ทีมของลูกค้ากรอก data manual" หรือ "เครื่องมือ OCR ที่จะระบุภายหลัง" ทีมส่วนใหญ่ค้นพบในเดือนที่ 3 ของ implementation ว่าฝ่ายธุรการตามไม่ทันการ transcription รายเดือน, vendor OCR ไม่เข้าใจโครงสร้างบิลค่าไฟดีพอที่จะ reliable, และสิ่งที่ขายเป็น "data integration challenge" จริงๆ คือ document understanding challenge ที่มีข้อกำหนดระดับ audit

บทความนี้พูดถึงวิธีที่เราแก้ปัญหานี้ใน production โดยใช้ simpliDoc — multilingual document AI platform ของ Simplico — เป็น ingestion layer ป้อนเข้า ESG Data Bridge เป็นบทความ technical satellite ที่ flagship เลื่อนไปคุยทีหลัง เป้าหมายเพื่อ IT architect, sustainability tech lead และทุกคนที่กำลังประเมิน "ก็ใช้ OCR ก็ได้" เป็น strategy

ทำไม OCR ทั่วไปถึงล้มเหลวกับบิลค่าไฟในเอเชีย

สัญชาตญาณแรกในการทำ project แบบนี้คือ "นี่ไม่ใช่งานของ OCR เหรอ?" Cloud OCR service สมัยใหม่ — AWS Textract, Google Document AI, Azure Form Recognizer — จัดการ structured document ได้ดีพอสมควร และมี template สำหรับบิลค่าไฟ แล้วทำไมถึงไม่ solve?

3 เหตุผล แต่ละข้อโดดเด่นที่ document ต่างๆ ใน folder ของโรงงานเดียวกัน

เหตุผลที่ 1: Layout ของบิลค่าไฟไม่ standardize ทั้งภายในและข้ามผู้ให้บริการในเอเชีย ในประเทศไทย บิล PEA (การไฟฟ้าส่วนภูมิภาค) ต่างจากบิล MEA (การไฟฟ้านครหลวง) และต่างจากบิลที่ออกโดยผู้ดำเนินการนิคมอุตสาหกรรมเช่น Amata หรือ WHA ที่ทำหน้าที่เป็นผู้ค้าปลีกไฟฟ้าระดับกลาง บิลของ TEPCO ในญี่ปุ่นต่างจากบิล KEPCO Japan, ต่างจาก Chubu Electric, ต่างจาก Hokuriku Electric บิลของ State Grid ในจีนต่างกันระหว่างจังหวัดและเปลี่ยน format มา 3 ครั้งใน 5 ปีที่ผ่านมา OCR template ทั่วไปจัดการได้หนึ่งหรือสองตัว และล้มเหลวกับที่เหลือ

เหตุผลที่ 2: Data point ที่คุณต้องการจริงๆ มักไม่ใช่ field ที่ติด label การรายงาน Scope 2 ระดับ audit ต้องการ kWh ที่ใช้ในช่วงรอบบิลสำหรับ facility นี้โดยเฉพาะ สิ่งที่ปรากฏในบิลคือ meter reading ตอนเริ่ม period, meter reading ตอนสิ้น period, ตัวคูณ (เพราะ meter อุตสาหกรรมบางตัวรายงานเป็น 10 หรือ 100 kWh ต่อหน่วยที่แสดง), ส่วน demand charge, และบรรทัด "ค่าพลังงาน" แยก kWh ที่คุณต้องการเป็นค่าที่ต้องคำนวณ — บางครั้งการไฟฟ้าฯ คำนวณและพิมพ์ บางครั้งปล่อยให้ลูกค้า derive เอง Field-extraction OCR ไม่รู้

เหตุผลที่ 3: ภาษาและตัวอักษร โรงงานในไทยอาจรับบิลเป็นภาษาไทย โดยมี field สำคัญในอักษรไทยและเลขที่บัญชีในเลขอารบิก บิลจีนอาจผสม Simplified Chinese กับ Traditional ใน format เก่าบางอัน บิลญี่ปุ่นผสม Kanji, Hiragana, Katakana กับเลขโรมันเต็มความกว้าง การแก้ด้วยตราที่ปั๊มมือบ่อยครั้งเป็นภาษาที่ต่างจากแบบฟอร์มที่พิมพ์ OCR service ที่จัดการตัวอักษรหนึ่งจัดการตัวอื่นได้ไม่ดี และการสลับ engine ต่อ document ไม่สามารถจัดการได้ใน operations

ผลลัพธ์ของการพยายามแก้ปัญหานี้ด้วย OCR ทั่วไปคือ extraction accuracy 70–85% ซึ่งดูดีใน demo และเป็นหายนะใน audit อัตรา error 15% ของบิลรายเดือน 40 facility คือบิลผิด 6 ใบต่อเดือน — และผู้สอบบัญชีจะหาเจอ

simpliDoc ทำอะไรจริงๆ

simpliDoc คือ multilingual document AI platform ของ Simplico Architecture สำหรับ use case การ ingest บิลค่าไฟหน้าตาแบบนี้:

flowchart TD
    A["Document arrival<br/>(PDF, scan, email attachment)"] --> B["Format normalization"]
    B --> C["Document classification<br/>(utility, type, region, language)"]
    C --> D["Layout-aware extraction"]
    D --> E["Field validation<br/>+ derivation rules"]
    E --> F["Confidence scoring"]
    F --> G{"Confidence<br/>threshold?"}
    G -->|"high"| H["Auto-publish<br/>to ESG Data Bridge"]
    G -->|"medium"| I["Human-in-the-loop<br/>review queue"]
    G -->|"low"| J["Reject<br/>+ escalate"]
    I --> H
    H --> K["Audit log<br/>(immutable, with source PDF)"]

5 component ที่สำคัญ ตามลำดับการสร้าง value โดยประมาณ

Component 1: Document classification

ก่อน extraction เกิดขึ้น document จะถูก classify การไฟฟ้าฯ ไหนออก? ภูมิภาคไหน? ประเภทบิลแบบไหน (commercial, industrial, time-of-use, demand-charge tariff)? ภาษาอะไร? template version ไหน (การไฟฟ้าฯ update format บิลเป็นระยะ และ format เก่ายังคงหมุนเวียนสำหรับ facility ที่ยังอยู่ใน plan เก่า)?

Classification สำคัญเพราะมัน route document ไปยัง extraction pipeline ที่ถูกต้อง บิล PEA Thailand industrial ผ่าน extraction logic หนึ่ง บิล TEPCO commercial ผ่านอันที่สอง บิล State Grid จีนผ่านอันที่สาม การพยายาม extract ทุกอย่างด้วย pipeline เดียวคือ failure mode ที่จำกัด OCR ทั่วไปไว้ที่ 70–85% accuracy

simpliDoc handle classification ผ่าน multimodal LLM (Claude ใน deployment ของเรา) ที่เห็นภาพ document บวก text ที่ extract ได้ และผลิต classification ที่ structured พร้อม confidence ขั้นตอน classification ใช้เวลา 1–2 วินาทีและฟรีโดยพื้นฐานที่ volume ที่เกี่ยวข้อง

Component 2: Layout-aware extraction

เมื่อ document ถูก classify แล้ว extraction ทำด้วย pipeline ที่ tune สำหรับ document family เฉพาะนั้น สำหรับ format ที่มี volume สูง (TEPCO, KEPCO, PEA, MEA, State Grid prefectural variant) นี่คือการรวม region-of-interest detection กับ structured-field extraction สำหรับ format ระยะยาว (การไฟฟ้าฯ ภูมิภาคของญี่ปุ่น, ผู้ดำเนินการนิคมอุตสาหกรรม, บิล legacy ที่ scan มา) นี่ fall back ไปสู่ multimodal LLM extraction ที่มีภาพ document เป็น input และ output schema ที่เข้มงวด

Output ของ extraction ไม่ใช่ free text — มันเป็น object ที่ structured ที่มี named field, unit, และ source-coordinate annotation สำหรับแต่ละ field, ระบบบันทึกว่าค่าถูกพบ ที่ไหนใน document ซึ่งสำคัญสำหรับ audit defensibility เมื่อผู้สอบบัญชีถามว่า "47,283 kWh นี้มาจากไหน" คำตอบรวม PDF, page, และ bounding box

Component 3: Field validation และ derivation rules

Field ที่ extract แล้วจะถูก validate กับ rule library ที่เฉพาะกับประเภท document ตัวอย่าง rule สำหรับบิลไฟฟ้าอุตสาหกรรม:

kWh ที่ใช้ต้องเท่ากับ (meter reading สิ้น − meter reading เริ่ม) × ตัวคูณ
ยอดรวมต้องเท่ากับ demand charge + energy charge + tax + adjustment
วันที่ของ billing period ต้องเป็น period ปฏิทินที่ valid (ไม่มีวันขาด, ไม่ overlap กับบิลก่อนหน้า)
รหัส facility บนบิลต้องตรงกับ facility ที่รู้ใน asset register
ตัวเลข kWh ต้องอยู่ในช่วงที่คาดหวังสำหรับ facility นี้ (จับ meter glitch และ decimal-point error)

เมื่อ rule fail, document ถูก flag เพื่อ review บาง failure recoverable ได้ — ระบบสามารถ re-extract field เฉพาะด้วย prompt ที่ใช้ความพยายามสูงกว่า อื่นๆ escalate ไปสู่ human review

Validation layer คือที่ที่ value ของ audit-defensibility ส่วนใหญ่อยู่ OCR ทั่วไปคืน "47283" โดยไม่มี context simpliDoc คืน "47,283 kWh, derive จาก meter reading 1,234,567 → 1,281,850, ตัวคูณ 1, billing period 2027-09-01 ถึง 2027-09-30, validate แล้วกับช่วงคาดหวัง 35,000–60,000 kWh สำหรับ facility นี้" ความต่างนั้นคือความต่างระหว่างตัวเลขที่คุณป้องกันได้และตัวเลขที่ถูก flag ใน audit

Component 4: Confidence scoring และ human-in-the-loop

ทุก extraction มี confidence score ที่ calibrated Extraction confidence สูงเผยแพร่อัตโนมัติไปยัง ESG Data Bridge Extraction confidence กลางไปยัง review queue ที่มนุษย์ (ปกติเจ้าหน้าที่ฝ่ายธุรการที่ handle บิลอยู่แล้ว, ตอนนี้ review แทน transcribe) ยืนยันหรือแก้ไข Extraction confidence ต่ำถูกปฏิเสธพร้อมเหตุผลที่ชัดเจน และ document escalate

Threshold สำคัญและ configurable ต่อประเภท document สำหรับบิลที่ standardize และ volume สูง (TEPCO commercial, PEA industrial), threshold confidence แน่น — cost ของ false-positive เป็นเรื่องจริง และ review capacity จำกัด สำหรับ format ระยะยาวที่ volume ต่ำ, threshold หลวมลง พร้อมการ review โดย human ที่สูงกว่าตามสัดส่วน

ใน production deployment เราเห็น 80–90% ของบิล auto-publish ที่ confidence สูง, 8–15% ต้องการการ review โดย human แบบเบา, และ 2–5% ถูกปฏิเสธ เปรียบเทียบกับการ transcribe manual นี่แทนการลด workload ของทีมธุรการ 80–90% ในขณะที่ปรับปรุง accuracy และ audit-defensibility พร้อมกัน

Component 5: Audit log ที่แก้ไขไม่ได้พร้อม source attachment

ทุกค่าที่ publish นำพา provenance: source PDF (เก็บแบบแก้ไขไม่ได้), วิธี extraction ที่ใช้, confidence score, validation rule ที่ผ่าน, ผู้ review (ถ้ามี), และ timestamp สำหรับแต่ละ step เมื่อผู้สอบบัญชีขอ traceability, audit log produce ได้ — ภายในการคลิกไม่กี่ครั้ง — document ดั้งเดิม, ค่าที่ extract, validation chain, และตัวเลขที่ publish สุดท้าย ไม่ต้องการการทำ forensic

Architecture ของ audit log นี้เป็น pattern เดียวกับที่ใช้ใน SOC analyst agent post สำหรับ security-event triage หลักการ transfer โดยตรง: structured output, calibrated confidence, immutable lineage, ไม่มี silent failure

ส่วนที่ยากที่ demo ไม่แสดง

ของบางอย่างที่ดูปกติใน pilot และพังที่ scale ที่เรียนรู้แบบเจ็บตัว

Bill format drift การไฟฟ้าฯ update bill format โดยไม่แจ้งและไม่มี documentation Pipeline ที่ใช้งานได้ดีมา 18 เดือนจะเริ่มสร้าง extraction error บนบิลที่ออกหลังการเปลี่ยน format การตรวจจับต้องการการ monitor drift อัตโนมัติ — เปรียบเทียบ pattern การ extract field ตามเวลาและ alert เมื่อ pattern shift หากไม่มีสิ่งนี้ คุณค้นพบเกี่ยวกับ format drift เมื่อผู้สอบบัญชีพบ error ใน report ที่ submit แล้ว

Multi-page และ stapled bills ลูกค้าอุตสาหกรรมมักได้รับบิลหลายหน้าที่ครอบคลุม sub-meter หลายตัวที่ site เดียว หรือ stapled bundle ที่หน้าปกสรุปบัญชีบิลหลายตัว Document classification ต้อง handle เป็น document bundle ไม่ใช่ document เดี่ยว ด้วยการ extract sub-document ต่อหน้า OCR service ทั่วไปไม่ทำ

Estimated reading vs actual reading บิลที่ mark ด้วยคำเทียบเท่า「推定」ในญี่ปุ่น, "推算" ในจีน, หรือ "ประมาณการ" ในไทย มีค่าที่ ไม่ควร feed เข้า Scope 2 calculation เป็น direct measurement ระบบต้องรู้จัก estimation flag และ exclude estimated period หรือ annotate downstream ให้ data quality สะท้อนใน disclosure OCR pipeline ส่วนใหญ่พลาดสิ่งนี้ทั้งหมด

Currency และ tax line item ปนกับ energy บิลเดียวกันรวม energy charge (ส่วนที่คุณต้องการ), demand charge (potentially relevant สำหรับ Scope 2 disaggregation), surcharge ต่างๆ, ภาษี, และ rebate การ extract เฉพาะบรรทัด energy สำหรับ emission calculation — และดึง cost แยกสำหรับ line item financial ถ้า sustainability report ของคุณ cross-reference — ต้องการ field-level discipline ที่ template ทั่วไปไม่ provide

Scanned bill กับการแก้ด้วยมือ เศษเสี้ยวที่น่าประหลาดใจของบิลมีรหัสโรงงาน, การแก้เลขที่บัญชี, หรือ "billed to wrong cost center" annotation ที่ฝ่ายธุรการเขียนด้วยมือ การแก้เหล่านี้มีความหมายในเชิง operations และต้องถูกจับ simpliDoc handle ผ่าน multimodal LLM extraction step ที่เห็นภาพและอ่านลายมือ Pipeline OCR text-extraction เพียวพลาด annotation เหล่านี้

PDF ที่จริงๆ เป็นภาพ เศษเสี้ยวที่ไม่เล็กของ "PDF" จากการไฟฟ้าฯ ภูมิภาคจริงๆ คือ wrapper รอบ scan image เดี่ยวที่ไม่มี text ที่ extract ได้ Pipeline ที่ branch บน "extractable text? then parse text. else fall through to OCR" handle สิ่งนี้ถูกต้อง Pipeline ที่สมมติว่า PDF มี text จะทิ้ง document เหล่านี้แบบเงียบ

หน้าตาในโค้ด

ภาพประกอบที่ลดทอนของ simpliDoc ingestion endpoint โดยเอาส่วน infrastructure ที่น่าเบื่อออก:

from fastapi import FastAPI, UploadFile
from pydantic import BaseModel
from typing import Literal
import anthropic

app = FastAPI()
client = anthropic.Anthropic()

class UtilityBillExtraction(BaseModel):
    utility: str
    facility_id: str
    billing_period_start: str
    billing_period_end: str
    kwh_consumed: float
    meter_reading_start: float
    meter_reading_end: float
    meter_multiplier: float
    is_estimated: bool
    confidence: float
    source_page: int
    source_bbox: list[float]

@app.post("/ingest_utility_bill")
async def ingest(pdf: UploadFile, facility_hint: str | None = None):
    # 1. Normalize: rasterize PDF page, extract embedded text
    pages = await pdf_normalizer.process(pdf)

    # 2. Classify: ผู้ให้บริการไหน, ภูมิภาค, format, ภาษา
    classification = await classifier.classify(pages)

    # 3. Route ไปยัง extraction pipeline ที่เหมาะสม
    if classification.is_high_volume_format:
        extraction = await structured_extractor.extract(
            pages, classification
        )
    else:
        extraction = await llm_extractor.extract(
            pages, classification, schema=UtilityBillExtraction
        )

    # 4. Validate กับ rule สำหรับ document type นี้
    validation = await rule_engine.validate(extraction, classification)
    if validation.has_blocking_issues:
        return {"status": "rejected", "issues": validation.issues}

    # 5. Score confidence; route ไป auto-publish หรือ review queue
    if extraction.confidence >= AUTO_PUBLISH_THRESHOLD:
        await esg_data_bridge.publish(extraction)
        await audit_log.write(pdf, classification, extraction, "auto")
        return {"status": "published", "extraction": extraction}
    elif extraction.confidence >= REVIEW_THRESHOLD:
        await review_queue.enqueue(pdf, classification, extraction)
        return {"status": "queued_for_review", "extraction": extraction}
    else:
        await escalation_queue.enqueue(pdf, classification, extraction)
        return {"status": "escalated", "reason": "low_confidence"}

ส่วนที่ไม่แสดงแต่จำเป็นใน production: format drift detection, rule library ต่อประเภท document, human review interface (ที่กำหนดว่าฝ่ายธุรการจะใช้จริงหรือไม่), และ immutable storage layer สำหรับ source PDF ที่อยู่รอดผ่านการ migrate ระบบและการเปลี่ยน platform

ต้นทุนการรัน

Token cost และ infrastructure ที่ volume ที่มีความหมาย สำหรับกลุ่มบริษัทเอเชียที่ process บิลค่าไฟ 2,000 ใบต่อเดือนใน 40 facility:

~2,000 document × ~3 LLM call ต่อ document (classify, extract, validate-on-failure) = 6,000 LLM call ต่อเดือน
ภาพ document เฉลี่ยที่ resolution เหมาะสม + extraction prompt = ~3K input token, ~400 output token ต่อ call
รวม: ~18M input token + ~2.4M output token ต่อเดือน

ที่ราคา Claude ปัจจุบัน นี่มีนัยสำคัญแต่จัดการได้ — และน้อยกว่า loaded cost ของฝ่ายธุรการ full-time 1 คนที่ทำ transcription manual อย่างมีนัยสำคัญ ในขณะที่ delivery accuracy สูงกว่าและ audit lineage ครบ Economic ดีขึ้นที่ volume สูงกว่า: ต้นทุนต่อ document แทบ flat ในขณะที่ต้นทุนแรงงาน manual scale linearly

สำหรับ deployment ที่ data residency หรือความ sensitive ต่อ cost สำคัญ LLM interface ของ simpliDoc ถูก abstract เหมือนกับ SOC integrator’s — Qwen2.5-VL ที่ deploy local หรืออื่นๆ ที่คล้ายกัน handle workload extraction เดียวกันที่ marginal cost ใกล้ศูนย์ พร้อม trade-off คุณภาพบางอย่างใน format ระยะยาว

ตำแหน่งของมันใน CSRD picture ที่กว้างกว่า

บิลค่าไฟเป็นหนึ่งในหลาย document type ที่ feed เข้า ESG Data Bridge Architecture simpliDoc เดียวกัน handle invoice เชื้อเพลิงสำหรับ Scope 1 calculation, ใบ waybill ขนส่งสำหรับ Scope 3 Category 4 (upstream transportation), supplier emission report สำหรับ Scope 3 Category 1, และ waste disposal manifest สำหรับการเปิดเผย E5 circular economy

Pattern เหมือนกันในแต่ละ case Document มาถึงในรูปแบบที่หลากหลาย, classification route ไป extraction pipeline ที่ถูกต้อง, extraction produce structured data พร้อม provenance, validation จับ error ก่อนถึง reporting layer, confidence scoring route document ที่ถูกต้องไป human review, และ immutable audit log รักษา traceability

หากไม่มี layer นี้, integration vendor — ใครก็ตาม — กำลังสร้าง pipe ที่เชื่อมระบบ operational ที่จริงๆ ไม่มี data ที่จำเป็น ด้วยมัน, ความจริงของ operations ของโรงงานในเอเชีย (PDF, scan, annotation ที่เขียนด้วยมือ, ภาษาผสม) กลายเป็นปัญหา engineering ที่จัดการได้ แทนที่จะเป็น silent failure point ที่พัง CSRD project ส่วนใหญ่ใน reporting cycle ที่สอง

ถ้าคุณกำลัง scope CSRD implementation และปัญหา document-data เริ่มเข้ามาในโฟกัส — หรือ pilot OCR ปัจจุบันของคุณกำลัง produce ตัวเลข accuracy ที่ดูยอมรับได้ใน demo และยอมรับไม่ได้ใน audit — นั่นคือบทสนทนาที่เราคุยกันที่ Simplico PDF ingestion layer ของ simpliDoc คือหนึ่งใน component ที่ทำให้ architecture flagship ของ ESG Data Bridge ทำงานได้จริงใน production และ deploy ใน operational context หลายตัวในเอเชีย ส่ง sample บิลของคุณมา และเราจะแสดง extraction บน format เฉพาะของคุณ