Reading Asian Utility Bills at Audit Quality: How simpliDoc Handles the PDF Problem in CSRD

The data source nobody wants to talk about

Walk into the General Affairs department of a Japanese, Thai, or Chinese factory and ask where last month’s electricity consumption data is. You will not be directed to a system. You will be directed to a folder — physical or digital — containing PDFs from the local power company. Some are computer-generated. Some are scans of paper bills with handwritten plant codes added by the GA staff. Some have stamps. Some are double-page foldouts where the second page didn’t quite scan straight.

This is the operational reality across most of Asian manufacturing. It is also the reality that ESRS E1-6 requires you to turn into audit-defensible Scope 2 emission disclosures, by location, by activity, with documented methodology, refreshed every reporting period.

CSRD reporting platform connectors do not read these files. Big 4 integration line items quietly assume "manual data entry by your team" or "OCR tooling, to be specified." Most teams discover in month three of implementation that the GA staff cannot keep up with monthly transcription, OCR vendors do not understand utility bill structure well enough to be reliable, and what was sold as a "data integration challenge" is actually a document understanding challenge with audit-grade requirements.

This post is about how we solve that problem in production, using simpliDoc — Simplico’s multilingual document AI platform — as the ingestion layer feeding the ESG Data Bridge. It’s the technical satellite post the flagship deferred to. It is targeted at IT architects, sustainability tech leads, and anyone evaluating "we’ll just use OCR" as a strategy.

Why generic OCR fails on Asian utility bills

The first instinct on a project like this is "isn’t this what OCR is for?" Modern cloud OCR services — AWS Textract, Google Document AI, Azure Form Recognizer — handle structured documents reasonably well, and they have utility bill templates. So why is this not solved?

Three reasons, each of which dominates at different documents in the same factory’s folder.

Reason 1: utility bill layouts are not standardized within or across Asian utilities. TEPCO’s commercial bill format differs from KEPCO Japan’s, which differs from Chubu Electric’s, which differs from Hokuriku Electric’s. The PEA (Provincial Electricity Authority) and MEA (Metropolitan Electricity Authority) bills in Thailand differ from each other and from the bills issued by industrial estate operators like Amata or WHA, who often act as intermediate utility resellers. Chinese state grid bills differ between provinces and have changed format three times in the last five years. Generic OCR templates handle one or two of these well and fail on the rest.

Reason 2: the data point you actually need is often not a labeled field. Audit-grade Scope 2 reporting needs kWh consumed during the billing period for this specific facility. What appears on the bill is often the meter reading at start of period, meter reading at end of period, a multiplier (because some industrial meters report in tens or hundreds of kWh per displayed unit), a demand charge component, and a separate "energy charge" line. The kWh you need is a calculated value — sometimes computed by the utility and printed, sometimes left for the customer to derive. Field-extraction OCR doesn’t know which.

Reason 3: language and script. A Thai factory might receive bills in Thai, with critical fields in Thai script and account numbers in Arabic numerals. A Chinese bill might mix Simplified Chinese characters with Traditional in some legacy formats. A Japanese bill mixes Kanji, Hiragana, and Katakana with full-width Roman numerals. Hand-stamped corrections are often in a different language than the printed form. OCR services that handle one script handle the others poorly, and switching engines per document is operationally untenable.

The result of attempting to solve this with generic OCR is a 70–85% extraction accuracy rate that looks fine in a demo and is catastrophic in audit. A 15% error rate on monthly bills across 40 facilities is six wrong invoices per month — and the auditor will find them.

What simpliDoc actually does

simpliDoc is Simplico’s multilingual document AI platform. The architecture for the utility-bill-ingestion use case looks like this:

flowchart TD
    A["Document arrival<br/>(PDF, scan, email attachment)"] --> B["Format normalization"]
    B --> C["Document classification<br/>(utility, type, region, language)"]
    C --> D["Layout-aware extraction"]
    D --> E["Field validation<br/>+ derivation rules"]
    E --> F["Confidence scoring"]
    F --> G{"Confidence<br/>threshold?"}
    G -->|"high"| H["Auto-publish<br/>to ESG Data Bridge"]
    G -->|"medium"| I["Human-in-the-loop<br/>review queue"]
    G -->|"low"| J["Reject<br/>+ escalate"]
    I --> H
    H --> K["Audit log<br/>(immutable, with source PDF)"]

Five components matter, in roughly this order of value contribution.

Component 1: Document classification

Before any extraction happens, the document is classified. Which utility issued it? Which region? Which billing type (commercial, industrial, time-of-use, demand-charge tariff)? Which language? Which template version (utilities update their bill formats periodically and old formats remain in circulation for facilities still on legacy plans)?

Classification matters because it routes the document to the right extraction pipeline. A TEPCO commercial bill goes through one extraction logic, a PEA Thailand industrial bill through another, a Chinese state grid bill through a third. Trying to extract everything with one universal pipeline is the failure mode that limits generic OCR to 70–85% accuracy.

simpliDoc handles classification via a multimodal LLM (Claude, in our deployments) seeing the document image plus any extractable text, and producing a structured classification with confidence. The classification step takes 1–2 seconds and is essentially free at the volumes involved.

Component 2: Layout-aware extraction

Once a document is classified, extraction is done with a pipeline tuned for that specific document family. For high-volume formats (TEPCO, KEPCO, PEA, MEA, State Grid prefectural variants), this is a combination of region-of-interest detection plus structured-field extraction. For long-tail formats (regional Japanese utilities, industrial estate intermediaries, scanned legacy bills), this falls back to a multimodal LLM extraction with the document image as input and a strict output schema.

The output of extraction is not free text — it’s a structured object with named fields, units, and source-coordinate annotations. For each field, the system records where on the document the value was found, which matters for audit defensibility. When the auditor asks "where did this 47,283 kWh figure come from," the answer includes the PDF, the page, and the bounding box.

Component 3: Field validation and derivation rules

The extracted fields are then validated against a rule library specific to the document type. Examples of rules for an industrial electricity bill:

The kWh consumed must equal (end meter reading − start meter reading) × multiplier
The total amount due must equal demand charge + energy charge + tax + adjustments
The billing period dates must form a valid calendar period (no missing days, no overlap with the previous bill)
The facility code on the bill must match a known facility in the asset register
The kWh figure must fall within an expected range for this facility (catching meter glitches and decimal-point errors)

When a rule fails, the document is flagged for review. Some failures are recoverable — the system can re-extract specific fields with higher-effort prompts. Others escalate to human review.

The validation layer is where most of the audit-defensibility value lives. Generic OCR returns "47283" with no context. simpliDoc returns "47,283 kWh, derived from meter readings 1,234,567 → 1,281,850, multiplier 1, billing period 2027-09-01 to 2027-09-30, validated against expected range 35,000–60,000 kWh for this facility." That difference is the difference between a number you can defend and a number that gets flagged in audit.

Component 4: Confidence scoring and human-in-the-loop

Every extraction has a calibrated confidence score. High-confidence extractions auto-publish to the ESG Data Bridge. Medium-confidence extractions go to a review queue where a human (typically the GA staff who already handles the bills, now reviewing instead of transcribing) confirms or corrects. Low-confidence extractions are rejected with an explicit reason and the document is escalated.

The thresholds matter and are configurable per document type. For high-volume standardized bills (TEPCO commercial, PEA industrial), confidence thresholds are tight — the cost of a false-positive is real, and review capacity is finite. For low-volume long-tail formats, thresholds are looser, with proportionally higher human review.

In production deployments, we typically see 80–90% of bills auto-publishing at high confidence, 8–15% requiring light human review, and 2–5% rejected. Compared to manual transcription, this represents 80–90% reduction in GA team workload while improving accuracy and audit-defensibility simultaneously.

Component 5: Immutable audit log with source attachment

Every published value carries provenance: the source PDF (stored immutably), the extraction method used, the confidence score, the validation rules that passed, the human reviewer (if any), and timestamps for each step. When the auditor asks for traceability, the audit log produces — within a few clicks — the original document, the extracted value, the validation chain, and the final published number. No forensic exercise required.

This audit-log architecture is the same pattern used in the SOC analyst agent post for security-event triage. The principles transfer directly: structured outputs, calibrated confidence, immutable lineage, no silent failures.

The hard parts the demo doesn’t show

A few things that look fine in pilot and break at scale, learned painfully.

Bill format drift. Utilities update bill formats with no notice and no documentation. A pipeline that worked fine for 18 months will start producing extraction errors on bills issued after a format change. Detection requires automated drift monitoring — comparing field-extraction patterns over time and alerting when the pattern shifts. Without this, you find out about format drift when an auditor finds errors in your filed report.

Multi-page and stapled bills. Industrial customers often receive multi-page bills covering multiple sub-meters at the same site, or stapled bundles where the cover page summarizes multiple billing accounts. Document classification needs to handle these as document bundles, not single documents, with sub-document extraction per page. Generic OCR services typically don’t.

Estimated readings vs actual readings. A bill marked with the equivalent of 「推定」(estimated) in Japanese, or "推算" in Chinese, or "ประมาณการ" in Thai, contains a value that should not feed into Scope 2 calculations as a direct measurement. The system needs to recognize estimation flags and either exclude estimated periods or annotate them downstream so the data quality is reflected in the disclosure. Most OCR pipelines miss this entirely.

Currency and tax line items mixed with energy. The same bill includes energy charge (the part you want), demand charge (potentially relevant for Scope 2 disaggregation), various surcharges, taxes, and rebates. Extracting only the energy line for emission calculations — and pulling cost separately for the financial line item if your sustainability report cross-references it — requires field-level discipline that generic templates don’t provide.

Scanned bills with handwritten corrections. A surprising fraction of bills have handwritten plant codes, account number corrections, or "billed to wrong cost center" annotations added by GA staff. These corrections are operationally meaningful and must be captured. simpliDoc handles them via the multimodal LLM extraction step, which sees the image and reads the handwriting. Pure text-extraction OCR pipelines miss these annotations.

PDFs that are actually images. A non-trivial fraction of "PDFs" from regional utilities are actually wrappers around a single scanned image with no extractable text. Pipelines that branch on "extractable text? then parse text. else fall through to OCR" handle this correctly. Pipelines that assume PDFs have text drop these documents silently.

What this looks like in code

A simplified illustration of the simpliDoc ingestion endpoint, with the boring infrastructure removed:

from fastapi import FastAPI, UploadFile
from pydantic import BaseModel
from typing import Literal
import anthropic

app = FastAPI()
client = anthropic.Anthropic()

class UtilityBillExtraction(BaseModel):
    utility: str
    facility_id: str
    billing_period_start: str
    billing_period_end: str
    kwh_consumed: float
    meter_reading_start: float
    meter_reading_end: float
    meter_multiplier: float
    is_estimated: bool
    confidence: float
    source_page: int
    source_bbox: list[float]

@app.post("/ingest_utility_bill")
async def ingest(pdf: UploadFile, facility_hint: str | None = None):
    # 1. Normalize: rasterize PDF pages, extract any embedded text
    pages = await pdf_normalizer.process(pdf)

    # 2. Classify: which utility, region, format, language
    classification = await classifier.classify(pages)

    # 3. Route to appropriate extraction pipeline
    if classification.is_high_volume_format:
        extraction = await structured_extractor.extract(
            pages, classification
        )
    else:
        extraction = await llm_extractor.extract(
            pages, classification, schema=UtilityBillExtraction
        )

    # 4. Validate against rules for this document type
    validation = await rule_engine.validate(extraction, classification)
    if validation.has_blocking_issues:
        return {"status": "rejected", "issues": validation.issues}

    # 5. Score confidence; route to auto-publish or review queue
    if extraction.confidence >= AUTO_PUBLISH_THRESHOLD:
        await esg_data_bridge.publish(extraction)
        await audit_log.write(pdf, classification, extraction, "auto")
        return {"status": "published", "extraction": extraction}
    elif extraction.confidence >= REVIEW_THRESHOLD:
        await review_queue.enqueue(pdf, classification, extraction)
        return {"status": "queued_for_review", "extraction": extraction}
    else:
        await escalation_queue.enqueue(pdf, classification, extraction)
        return {"status": "escalated", "reason": "low_confidence"}

The pieces not shown but essential in production: format drift detection, the rule library per document type, the human review interface (which determines whether the GA staff actually use it), and the immutable storage layer for source PDFs that survives system migrations and platform changes.

What it costs to run

Token costs and infrastructure at meaningful volume. For an Asian conglomerate processing 2,000 utility bills per month across 40 facilities:

~2,000 documents × ~3 LLM calls per document (classify, extract, validate-on-failure) = 6,000 LLM calls per month
Average document image at appropriate resolution + extraction prompt = ~3K tokens input, ~400 tokens output per call
Total: ~18M input tokens + ~2.4M output tokens per month

At current Claude pricing, this is meaningful but tractable — and substantially less than the loaded cost of a single full-time GA staff member doing manual transcription, while delivering higher accuracy and full audit lineage. The economics get more favorable at higher volumes; the per-document cost is roughly flat while manual labor costs scale linearly.

For deployments where data residency or cost sensitivity matters, simpliDoc’s LLM interface is abstracted in the same way the SOC integrator’s is — local-deployed Qwen2.5-VL or similar handles the same extraction workload at near-zero marginal cost, with some quality trade-off on long-tail formats.

Where this fits in the broader CSRD picture

Utility bills are one of several document types feeding the ESG Data Bridge. The same simpliDoc architecture handles fuel invoices for Scope 1 calculations, transport waybills for Scope 3 Category 4 (upstream transportation), supplier emissions reports for Scope 3 Category 1, and waste disposal manifests for the E5 circular economy disclosures.

The pattern is the same in each case. Document arrives in a heterogeneous format, classification routes it to the right extraction pipeline, extraction produces structured data with provenance, validation catches errors before they reach the reporting layer, confidence scoring routes the right documents to human review, and an immutable audit log preserves traceability.

Without this layer, the integration vendor — whoever they are — is building pipes that connect operational systems that don’t actually contain the data needed. With it, the operational reality of Asian factories (PDFs, scans, handwritten annotations, mixed languages) becomes a tractable engineering problem rather than the silent failure point that breaks most CSRD projects in their second reporting cycle.

If you’re scoping CSRD implementation and the document-data problem is starting to come into focus — or if your current OCR pilot is producing accuracy numbers that look acceptable in demo and unacceptable in audit — that’s the conversation we have at Simplico. simpliDoc’s PDF ingestion layer is one of the components that makes the ESG Data Bridge flagship architecture actually work in production, and it’s deployed across multiple Asian operational contexts. Send us a sample of your bills and we’ll show what extraction looks like on your specific formats.