In the previous post in this series, we built a streaming AI chatbot backed by a FastAPI server. That covers the majority of enterprise use cases. But for a growing set of requirements — strict data residency, offline functionality, zero per-token cost at scale — the right answer is to run the model directly on the device.
In 2026, this is no longer a research curiosity. Sub-20ms inference on Snapdragon 8 Elite devices, models in the 1–5 GB range reaching GPT-3.5 equivalent quality, and a React Native library that wraps the whole thing in familiar React hooks — on-device AI has crossed into production viability.
This guide covers the practical path: choosing the right tooling, integrating react-native-executorch, selecting a model that fits on a phone, and building the hybrid architecture that lets you ship cloud AI today and move to on-device later without rewriting your UI.
Why On-Device? The Four-Factor Case
1. Privacy by architecture. If inference runs on the device, user data never leaves it. No API call to log, no third-party server to breach. For healthcare, finance, HR, and any app handling personal data under PDPA, APPI, or PIPL, this removes an entire category of compliance risk.
2. Latency. Cloud round-trips add 200–800ms before the first token appears. On-device first-token latency on a Snapdragon 8 Elite is under 100ms. For conversational UX, that difference is felt.
3. Cost at scale. At high message volume, per-token API costs compound quickly. A 1B-parameter on-device model has zero marginal cost per conversation after the initial download.
4. Offline. Factory floors, hospital wards, rural field workers, aircraft — anywhere the network is unreliable, on-device AI keeps working.
The 2026 On-Device Tooling Landscape for React Native
Three approaches exist for running AI models in React Native:
| Library | Backend | Best for |
|---|---|---|
| react-native-executorch | Meta ExecuTorch | LLMs, speech-to-text, image classification — full hooks API |
| react-native-fast-tflite | Google LiteRT 2.x | Classical ML, vision models, existing TFLite models |
| MediaPipe via native module | Google MediaPipe | Pose estimation, object detection, text classification |
For LLM inference in React Native in 2026, react-native-executorch (by Software Mansion) is the clear choice. It provides:
- A
useLLMhook that handles model loading, token streaming, and memory management - A
useWhisperhook for on-device speech-to-text (Whisper via ExecuTorch) - Computer vision hooks:
useImageClassification,useObjectDetection,useOCR - Pre-optimized models on HuggingFace — no model conversion required
- Full compatibility with React Native 0.84’s New Architecture (mandatory as of March 2026)
Note: react-native-executorch requires the New Architecture. If your project still uses the legacy bridge, this is the moment to migrate — RN 0.84 removed legacy bridge support from iOS builds entirely.
Part 1: Setup
yarn add react-native-executorch
yarn add react-native-executorch-expo-resource-fetcher
yarn add expo-file-system expo-asset
Initialize ExecuTorch in your app entry point:
// app/_layout.tsx (or App.tsx)
import { initExecutorch } from 'react-native-executorch';
import { ExpoResourceFetcher } from 'react-native-executorch-expo-resource-fetcher';
initExecutorch({
resourceFetcher: ExpoResourceFetcher,
});
Part 2: Choosing a Model
On-device model selection involves three hard constraints: RAM, storage, and inference speed. As of mid-2026:
| Model | Size | RAM Required | First token latency | Notes |
|---|---|---|---|---|
| LFM2.5 1.2B Instruct | ~900 MB | ~1.4 GB | ~80ms (S8 Elite) | Best quality-to-size ratio; default recommendation |
| Llama 3.2 1B Instruct | ~850 MB | ~1.3 GB | ~90ms (S8 Elite) | Strong multilingual support |
| Qwen2.5 0.5B | ~400 MB | ~700 MB | ~45ms (S8 Elite) | Ultra-fast; good for mid-range devices |
| Llama 3.2 3B Instruct | ~2.0 GB | ~3.0 GB | ~210ms (S8 Elite) | Higher quality; flagship-only |
For most use cases, start with LFM2.5 1.2B. It’s the default model in react-native-executorch’s HuggingFace collection, well-optimized for ExecuTorch, and runs at interactive speed on any 2024+ flagship.
For ASEAN deployments where mid-range devices (6–8 GB RAM) are common, Qwen2.5 0.5B is worth considering — it covers most conversational tasks and the smaller footprint means fewer OOM crashes across the device fragmentation landscape.
Part 3: Building the On-Device Chat UI
// components/OnDeviceChat.tsx
import { useEffect, useState } from 'react';
import {
View,
Text,
TextInput,
TouchableOpacity,
FlatList,
ActivityIndicator,
StyleSheet,
} from 'react-native';
import { useLLM, models, type Message } from 'react-native-executorch';
export function OnDeviceChat() {
const [input, setInput] = useState('');
const [messages, setMessages] = useState<Message[]>([
{ role: 'system', content: 'You are a helpful assistant.' },
]);
const llm = useLLM({
model: models.llm.lfm2_5_1_2b_instruct(),
});
// Show download progress on first load
if (llm.downloadProgress < 1) {
return (
<View style={styles.center}>
<Text style={styles.label}>Downloading AI model…</Text>
<Text style={styles.progress}>
{Math.round(llm.downloadProgress * 100)}%
</Text>
<Text style={styles.hint}>
This only happens once. The model is cached on your device.
</Text>
</View>
);
}
if (!llm.isReady) {
return (
<View style={styles.center}>
<ActivityIndicator size="large" />
<Text style={styles.label}>Loading model into memory…</Text>
</View>
);
}
const handleSend = async () => {
const text = input.trim();
if (!text || llm.isGenerating) return;
const userMsg: Message = { role: 'user', content: text };
const newMessages = [...messages, userMsg];
setMessages(newMessages);
setInput('');
// Placeholder for the streaming reply
setMessages((prev) => [
...prev,
{ role: 'assistant', content: '' },
]);
await llm.generate(newMessages, {
onToken: (token: string) => {
setMessages((prev) => {
const updated = [...prev];
updated[updated.length - 1] = {
role: 'assistant',
content: updated[updated.length - 1].content + token,
};
return updated;
});
},
});
};
const displayMessages = messages.filter((m) => m.role !== 'system');
return (
<View style={styles.container}>
<View style={styles.badge}>
<Text style={styles.badgeText}>On-device · No data sent</Text>
</View>
<FlatList
data={displayMessages}
keyExtractor={(_, i) => i.toString()}
renderItem={({ item }) => (
<View
style={[
styles.bubble,
item.role === 'user' ? styles.userBubble : styles.aiBubble,
]}
>
<Text style={styles.bubbleText}>{item.content}</Text>
</View>
)}
/>
<View style={styles.inputRow}>
<TextInput
style={styles.input}
value={input}
onChangeText={setInput}
placeholder="Ask anything…"
multiline
/>
<TouchableOpacity
style={[styles.sendBtn, llm.isGenerating && styles.disabled]}
onPress={handleSend}
disabled={llm.isGenerating}
>
{llm.isGenerating ? (
<ActivityIndicator color="#fff" size="small" />
) : (
<Text style={styles.sendText}>Send</Text>
)}
</TouchableOpacity>
</View>
</View>
);
}
const styles = StyleSheet.create({
container: { flex: 1, backgroundColor: '#f5f5f5' },
center: { flex: 1, justifyContent: 'center', alignItems: 'center', padding: 24 },
badge: {
backgroundColor: '#e8f5e9',
paddingHorizontal: 12,
paddingVertical: 4,
margin: 8,
borderRadius: 12,
alignSelf: 'center',
},
badgeText: { color: '#2e7d32', fontSize: 12, fontWeight: '600' },
label: { fontSize: 16, fontWeight: '600', marginBottom: 8 },
progress: { fontSize: 32, fontWeight: '700', marginBottom: 8 },
hint: { fontSize: 13, color: '#666', textAlign: 'center' },
bubble: { margin: 8, padding: 12, borderRadius: 16, maxWidth: '80%' },
userBubble: { alignSelf: 'flex-end', backgroundColor: '#0066ff' },
aiBubble: { alignSelf: 'flex-start', backgroundColor: '#ffffff' },
bubbleText: { fontSize: 15, lineHeight: 22 },
inputRow: {
flexDirection: 'row',
padding: 8,
backgroundColor: '#fff',
borderTopWidth: 1,
borderColor: '#e0e0e0',
},
input: { flex: 1, fontSize: 15, paddingHorizontal: 12, maxHeight: 100 },
sendBtn: {
backgroundColor: '#0066ff',
borderRadius: 20,
paddingHorizontal: 18,
justifyContent: 'center',
},
disabled: { backgroundColor: '#aaa' },
sendText: { color: '#fff', fontWeight: '600' },
});
Part 4: The Hybrid Architecture
Fully on-device is not always the right answer. A production app often needs both:
- On-device for privacy-sensitive queries, offline scenarios, and high-volume chat
- Cloud for complex reasoning tasks, long document synthesis, and users on older or low-end devices
The cleanest pattern is to implement both paths behind a single interface and route at runtime:
// hooks/useHybridAI.ts
import { useLLM, models } from 'react-native-executorch';
import { useState } from 'react';
const API_URL = process.env.EXPO_PUBLIC_API_URL ?? '';
export function useHybridAI() {
const llm = useLLM({ model: models.llm.lfm2_5_1_2b_instruct() });
const [isCloudMode, setIsCloudMode] = useState(false);
// Fall back to cloud if model isn't ready or device is low-memory
const shouldUseCloud = isCloudMode || !llm.isReady;
const generate = async (
messages: Array<{ role: string; content: string }>,
onToken: (t: string) => void
) => {
if (shouldUseCloud) {
// Cloud path — same FastAPI endpoint from R-03
const response = await fetch(`${API_URL}/chat`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ messages }),
reactNative: { textStreaming: true },
} as RequestInit);
const reader = response.body?.getReader();
const decoder = new TextDecoder();
while (reader) {
const { done, value } = await reader.read();
if (done) break;
for (const line of decoder.decode(value, { stream: true }).split('\n')) {
if (line.startsWith('data: ') && line.slice(6) !== '[DONE]') {
onToken(line.slice(6));
}
}
}
} else {
// On-device path
await llm.generate(messages, { onToken });
}
};
return {
generate,
isOnDevice: !shouldUseCloud,
isReady: llm.isReady || true, // Cloud is always ready
downloadProgress: llm.downloadProgress,
setCloudMode: setIsCloudMode,
};
}
The UI shows a simple indicator of which mode is active. Users can toggle to cloud if they need more capable responses — keeping control visible rather than invisible.
Part 5: Production Gotchas
Model download UX is critical. A 900 MB download on first launch will cause users to abandon the feature. Never trigger it at app startup. Gate it behind an explicit opt-in: "Enable AI assistant (downloads 900 MB)". Show progress, allow cancellation, and make clear the download only happens once.
Memory management. LLM inference holds a large allocation in RAM for the full session. Always unload the model when leaving the screen or when the app backgrounds:
useEffect(() => {
return () => {
llm.interrupt(); // Stop any ongoing generation
// Model memory is released when the hook unmounts
};
}, []);
Context window on-device is smaller. LFM2.5 1.2B uses a sliding window context strategy — on-device models typically support 2K–8K tokens, versus the 200K window of cloud models. Implement a sliding window in your message history:
const MAX_CONTEXT_TOKENS = 1500; // conservative
function trimMessages(messages: Message[]): Message[] {
const system = messages.filter((m) => m.role === 'system');
const rest = messages.filter((m) => m.role !== 'system');
// Keep most recent turns; crude token estimate: 1 token ≈ 4 chars
let chars = 0;
const kept: Message[] = [];
for (let i = rest.length - 1; i >= 0; i--) {
chars += rest[i].content.length / 4;
if (chars > MAX_CONTEXT_TOKENS) break;
kept.unshift(rest[i]);
}
return [...system, ...kept];
}
Battery and thermals. Extended generation heats the device and drains battery. Cap max_tokens at 256 for conversational use, show a visible "Thinking…" indicator so users understand the model is working, and avoid running inference in the background.
Device fragmentation. On-device AI works best on 2024+ flagships. For mid-range devices (4–6 GB RAM), use Qwen2.5 0.5B or fall back to cloud automatically by catching OOM errors:
try {
await llm.generate(messages, { onToken });
} catch (err: unknown) {
if (err instanceof Error && err.message.includes('out of memory')) {
setIsCloudMode(true); // Permanent fallback for this session
await cloudGenerate(messages, onToken);
}
}
When to Choose On-Device vs Cloud
| Requirement | On-Device | Cloud |
|---|---|---|
| Data privacy mandate (PDPA, APPI, PIPL) | ✅ Strong | ⚠️ Depends on region |
| Offline operation | ✅ | ✗ |
| Long document synthesis | ⚠️ Limited context | ✅ 200K+ tokens |
| Complex multi-step reasoning | ⚠️ 1B model limits | ✅ Frontier models |
| Zero per-message cost | ✅ | ✗ |
| Consistent across all device tiers | ⚠️ RAM gating required | ✅ |
| First-time setup friction | ⚠️ Model download | ✅ Instant |
The hybrid architecture from Part 4 lets you start with cloud and progressively shift traffic on-device as your model delivery pipeline matures.
FAQ
Does react-native-executorch work with Expo managed workflow?
Yes, with expo-dev-client. The ExpoResourceFetcher handles model downloads via expo-file-system and expo-asset. You cannot use it with Expo Go (native modules are required).
Which devices support on-device LLM inference?
Any device with 6+ GB RAM can run LFM2.5 1.2B or Llama 3.2 1B. Qwen2.5 0.5B works on 4 GB devices. For hardware-accelerated inference, Snapdragon 8 Gen 3 and 8 Elite use the QNN NPU delegate; Apple A17 Pro and later use the ANE (Apple Neural Engine) delegate. Older chipsets fall back to CPU inference, which is slower but functional.
Can I use my own fine-tuned model?
Yes. Export your PyTorch model to .pte format using torch.export and the ExecuTorch edge compilation pipeline, then host it and pass the URL to useLLM. See the ExecuTorch documentation for the export pipeline.
How do I update the model OTA without an app store release?
Host the .pte file on your CDN and version it. On app launch, check for a new version and download it to the device’s file system via expo-file-system. The model is loaded from disk, so no code change is needed to switch model versions.
Is on-device AI suitable for enterprise RAG applications?
For document QA on a private corpus, the hybrid approach works best: run embeddings and retrieval on-device (the context is small), then use a cloud model for the final synthesis step where context length matters most. This is the architecture we use when connecting simpliDoc’s document layer to mobile frontends.
What’s Next
On-device inference covers the chat and offline use cases. The next step in the R-series connects the React Native app to a private document corpus — combining the RAG pipeline from the D-series with the mobile stack from R-03 and R-04.
Have a project that requires private on-device AI, or a mobile app that needs to query internal documents without sending data to the cloud? Contact the Simplico team.
Latest Posts
- Your Staff Have 24 Passwords. Your Business Has 24 Attack Surfaces. June 11, 2026
- The Security Risk Sitting Quietly in Your Engineering Org June 8, 2026
- SOAR and Alert Fatigue: Why Your SOC Is Drowning in Alerts (and How Automation Actually Helps) June 7, 2026
- MES vs ERP: What’s the Difference and Which Does Your Factory Actually Need? June 7, 2026
- React Native vs Flutter in 2026: How to Actually Choose June 4, 2026
- React Native in 2026: Is It Still Worth Building With? June 3, 2026
