On-Device AI ใน React Native: รัน LLM บนเครื่องโดยตรงด้วย ExecuTorch

ในบทความก่อนหน้า เราสร้าง AI chatbot ที่ใช้ FastAPI backend รูปแบบนั้นครอบคลุม enterprise use case ส่วนใหญ่ได้ดี แต่สำหรับความต้องการบางประเภท — การประมวลผลข้อมูลส่วนบุคคลโดยไม่ส่งออกนอกอุปกรณ์ตาม PDPA และ พ.ร.บ. ความมั่นคงปลอดภัยไซเบอร์ การทำงานแบบ offline และการควบคุมต้นทุนที่ปริมาณสูง — คำตอบที่ถูกต้องคือรัน model โดยตรงบนอุปกรณ์

ในปี 2026 นี้ไม่ใช่แค่ proof-of-concept อีกต่อไป: inference latency ต่ำกว่า 20ms บนชิป Snapdragon 8 Elite, model ขนาด 1–5 GB ให้คุณภาพเทียบเท่า GPT-3.5 และ library สำหรับ React Native ที่ห่อทุกอย่างไว้ใน React hooks คุ้นเคย — on-device AI ข้ามเส้นสู่ production จริงแล้ว

ทำไมต้อง On-Device? สี่เหตุผลหลัก

1. Privacy by architecture ตาม PDPA. ถ้า inference รันบนอุปกรณ์ ข้อมูลผู้ใช้ไม่ออกไปไหนเลย ไม่มี API call ที่บันทึกได้ ไม่มี server ของบุคคลที่สามที่จะถูกเจาะ สำหรับแอปที่ประมวลผลข้อมูลส่วนบุคคลภายใต้ PDPA วิธีนี้กำจัด risk หนึ่งประเภทออกไปจากสถาปัตยกรรมได้เลย

บริการที่เกี่ยวข้อง

2. Latency. Cloud round-trip เพิ่มเวลา 200–800ms ก่อน token แรกจะปรากฏ On-device first-token latency บน Snapdragon 8 Elite ต่ำกว่า 100ms สำหรับ UX การสนทนา ความต่างนี้ผู้ใช้รู้สึกได้ชัดเจน

3. ต้นทุนที่ปริมาณสูง. เมื่อ message volume สูง ค่า per-token ของ API รวมกันได้มาก model ขนาด 1B parameter บนอุปกรณ์มีต้นทุนส่วนเพิ่มเป็นศูนย์หลังจาก download ครั้งแรก

4. Offline. โรงงาน, ห้องพยาบาล, พื้นที่ห่างไกล, ในเครื่องบิน — ทุกที่ที่เครือข่ายไม่เสถียร on-device AI ยังทำงานได้

เครื่องมือ On-Device สำหรับ React Native ในปี 2026

Library	Backend	เหมาะกับ
react-native-executorch	Meta ExecuTorch	LLMs, speech-to-text, image classification — hooks API ครบ
react-native-fast-tflite	Google LiteRT 2.x	Classical ML, vision models, TFLite models เดิม
MediaPipe via native module	Google MediaPipe	Pose estimation, object detection, text classification

สำหรับ LLM inference ใน React Native ในปี 2026 react-native-executorch (โดย Software Mansion) คือตัวเลือกที่ชัดเจน มี useLLM hook จัดการ model loading, token streaming และ memory management, useWhisper สำหรับ speech-to-text บนเครื่อง และ hooks สำหรับ computer vision ครบชุด

หมายเหตุ: react-native-executorch ต้องการ New Architecture RN 0.84 (มีนาคม 2026) ลบ legacy bridge support ออกจาก iOS builds แล้ว

Part 1: Setup

yarn add react-native-executorch
yarn add react-native-executorch-expo-resource-fetcher
yarn add expo-file-system expo-asset

Initialize ใน app entry point:

import { initExecutorch } from 'react-native-executorch';
import { ExpoResourceFetcher } from 'react-native-executorch-expo-resource-fetcher';

initExecutorch({
  resourceFetcher: ExpoResourceFetcher,
});

Part 2: เลือก Model ให้เหมาะกับตลาด ASEAN

Model	ขนาด	RAM ที่ต้องการ	First token latency	หมายเหตุ
LFM2.5 1.2B Instruct	~900 MB	~1.4 GB	~80ms	คุณภาพต่อขนาดดีที่สุด
Llama 3.2 1B	~850 MB	~1.3 GB	~90ms	Multilingual ดี
Qwen2.5 0.5B	~400 MB	~700 MB	~45ms	สำหรับ mid-range ที่ RAM 4–6 GB
Llama 3.2 3B	~2.0 GB	~3.0 GB	~210ms	คุณภาพสูง, flagship เท่านั้น

สำหรับตลาดไทยและ ASEAN ที่ mid-range device (RAM 4–6 GB) ยังพบมาก Qwen2.5 0.5B คุ้มค่ากว่า — ครอบคลุม conversational task ส่วนใหญ่ได้ และ footprint เล็กกว่าช่วยลด OOM crash บน device ที่หลากหลาย

Part 3: Chat UI บนเครื่อง

import { useLLM, models, type Message } from 'react-native-executorch';

const llm = useLLM({
  model: models.llm.lfm2_5_1_2b_instruct(),
});

// แสดง download progress ครั้งแรก
if (llm.downloadProgress < 1) {
  return (
    <View style={styles.center}>
      <Text>กำลัง download โมเดล AI…</Text>
      <Text>{Math.round(llm.downloadProgress * 100)}%</Text>
      <Text>ทำแค่ครั้งเดียว เก็บไว้ในเครื่อง</Text>
    </View>
  );
}

// Generate พร้อม token streaming
await llm.generate(messages, {
  onToken: (token: string) => {
    setMessages((prev) => {
      const updated = [...prev];
      updated[updated.length - 1].content += token;
      return [...updated];
    });
  },
});

Part 4: Hybrid Architecture — On-Device + Cloud Fallback

สำหรับแอป production การผสม on-device กับ cloud เป็น pattern ที่ฉลาดที่สุด:

On-device สำหรับ query ที่มีข้อมูลส่วนบุคคล, offline scenario และ volume สูง
Cloud สำหรับ reasoning ซับซ้อน, เอกสารยาว และ user บน device รุ่นเก่า

export function useHybridAI() {
  const llm = useLLM({ model: models.llm.lfm2_5_1_2b_instruct() });
  const [isCloudMode, setIsCloudMode] = useState(false);

  const generate = async (messages, onToken) => {
    if (isCloudMode || !llm.isReady) {
      // Cloud path — FastAPI endpoint จาก R-03
      await cloudGenerate(messages, onToken);
    } else {
      await llm.generate(messages, { onToken });
    }
  };

  return { generate, isOnDevice: !isCloudMode && llm.isReady };
}

UI แสดง badge "บนเครื่อง · ไม่ส่งข้อมูล" หรือ "Cloud AI" ให้ผู้ใช้รู้ว่ากำลังใช้ mode ไหน

Part 5: ข้อควรระวังใน Production

UX การ download model คือจุดตาย. อย่า trigger download ตอน app launch ให้ผู้ใช้ opt-in เองพร้อมแสดงขนาดไฟล์ "เปิดใช้ AI assistant (download 900 MB)" แสดง progress และให้ยกเลิกได้

จัดการ Memory. Unload model เมื่อออกจากหน้าจอ:

useEffect(() => {
  return () => { llm.interrupt(); };
}, []);

Context window เล็กกว่า Cloud. Model 1B บนเครื่องรองรับ 2K–8K tokens เท่านั้น ใช้ sliding window trim ประวัติการสนทนาก่อนส่งให้ model

Device fragmentation ในไทย. OOM crash บน mid-range ให้ catch และ fallback ไป cloud อัตโนมัติ:

try {
  await llm.generate(messages, { onToken });
} catch (err: unknown) {
  if (err instanceof Error && err.message.includes('out of memory')) {
    setIsCloudMode(true);
    await cloudGenerate(messages, onToken);
  }
}

FAQ

react-native-executorch ใช้กับ Expo managed workflow ได้ไหม?

ได้ โดยใช้ expo-dev-client แต่ใช้กับ Expo Go ไม่ได้เพราะต้องการ native module

Device ไหนรองรับ on-device LLM ได้บ้าง?

Device ที่มี RAM 6 GB+ รัน LFM2.5 1.2B ได้ RAM 4 GB ใช้ Qwen2.5 0.5B ได้ สำหรับ hardware acceleration ใช้ Snapdragon 8 Gen 3 หรือ 8 Elite (QNN NPU) หรือ Apple A17 Pro+ (ANE)

ใช้ model ที่ fine-tune เองได้ไหม?

ได้ Export PyTorch model เป็น .pte format ด้วย torch.export แล้วส่ง URL ให้ useLLM

เหมาะกับ enterprise RAG application ไหม?

Pattern ที่ดีที่สุดคือ hybrid: รัน embedding และ retrieval บนเครื่อง แล้วใช้ cloud model สำหรับ synthesis ที่ต้องการ context window ยาว — เป็น architecture เดียวกับที่ Simplico ใช้เชื่อม simpliDoc กับ mobile frontend

ขั้นตอนถัดไป

บทความต่อไปในซีรีส์ R จะเชื่อม React Native app เข้ากับ private document corpus รวม RAG pipeline จากซีรีส์ D เข้ากับ mobile stack จาก R-03 และ R-04

มีโปรเจกต์ที่ต้องการ on-device AI แบบ private หรือแอปมือถือที่ต้องค้นหาข้อมูลจากเอกสารภายในโดยไม่ส่งข้อมูลออก? ติดต่อทีม Simplico