Most React Native tutorials stop at the UI layer. They show you how to render chat bubbles and handle keyboard offsets—then hand-wave the backend with a vague "call the OpenAI API from your app."
That approach has two problems. First, you’re putting your API key inside a mobile binary anyone can extract. Second, you have no server-side control: no rate limiting, no user context injection, no logging, no ability to swap models without pushing an app update.
This guide takes a production-minded path. We build a FastAPI backend that handles the LLM connection with streaming Server-Sent Events (SSE), then wire it to an Expo (React Native) front-end that renders responses token by token. The same backend pattern integrates cleanly with private LLM deployments—so if your client eventually wants to run inference on-premise, you change one URL in the config, not the entire codebase.
The Architecture
flowchart TD
A["Mobile App Expo SDK 54"] --> B["FastAPI Backend"]
B --> C["LLM Provider"]
C --> D["SSE Stream"]
D --> E["Chunked Fetch RN 0.81"]
E --> F["Chat UI renders tokens"]
Key design decisions:
FastAPI over a serverless route. FastAPI gives you WebSocket support, dependency injection for auth middleware, and easy integration with Python-based private LLMs (Ollama, vLLM, LiteLLM). If you’re building for enterprise clients who may later need on-premise AI, a Python backend is the right long-term choice.
SSE over WebSockets for streaming. SSE is one-directional and HTTP/1.1-compatible, which makes it simpler to proxy, cache, and load-balance. React Native’s fetch API (as of RN 0.81) supports reading chunked responses incrementally—the same result without needing a WebSocket library.
LLM-agnostic backend. The provider is an env-var swap. Your mobile app never needs to know whether it’s hitting Claude, GPT-5, or a private Llama deployment.
Choosing a Model for Mobile Chatbots
Before writing code, pick your model. In June 2026 the price/performance landscape looks like this:
| Model | Input / 1M tokens | Output / 1M tokens | Best for |
|---|---|---|---|
| Claude Haiku 4.5 | $1.00 | $5.00 | High-volume mobile chatbots, FAQ bots |
| Claude Sonnet 4.6 | $3.00 | $15.00 | Complex reasoning, multi-turn sales assistants |
| GPT-5.4 | $2.50 | $15.00 | OpenAI-native toolchains |
| DeepSeek V4 Flash | $0.14 | $0.28 | Cost-sensitive ASEAN deployments |
For most mobile chatbots—support bots, onboarding assistants, FAQ handlers—Claude Haiku 4.5 hits the right balance. Its 200K context window comfortably holds long conversation histories, and at $1.00/M input tokens it costs roughly 60× less per conversation than Sonnet. If your app needs multi-step reasoning or nuanced synthesis (a document assistant built on top of simpliDoc, for example), step up to Sonnet 4.6.
Part 1: FastAPI Backend
Project setup
mkdir chatbot-api && cd chatbot-api
python -m venv .venv && source .venv/bin/activate
pip install fastapi uvicorn anthropic python-dotenv
Create a .env file:
ANTHROPIC_API_KEY=your_key_here
MODEL_ID=claude-haiku-4-5
SYSTEM_PROMPT="You are a helpful assistant for Acme Corp."
Streaming chat endpoint
# main.py
import os
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from typing import List
import anthropic
from dotenv import load_dotenv
load_dotenv()
app = FastAPI()
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
MODEL = os.getenv("MODEL_ID", "claude-haiku-4-5")
SYSTEM = os.getenv("SYSTEM_PROMPT", "You are a helpful assistant.")
class Message(BaseModel):
role: str # "user" or "assistant"
content: str
class ChatRequest(BaseModel):
messages: List[Message]
def stream_response(messages: List[Message]):
"""Generator that yields SSE-formatted tokens."""
with client.messages.stream(
model=MODEL,
max_tokens=1024,
system=SYSTEM,
messages=[m.model_dump() for m in messages],
) as stream:
for text in stream.text_stream:
# SSE format: each chunk prefixed with "data: "
yield f"data: {text}\n\n"
yield "data: [DONE]\n\n"
@app.post("/chat")
async def chat(request: ChatRequest):
if not request.messages:
raise HTTPException(status_code=400, detail="messages cannot be empty")
return StreamingResponse(
stream_response(request.messages),
media_type="text/event-stream",
headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
)
@app.get("/health")
async def health():
return {"status": "ok", "model": MODEL}
Run locally:
uvicorn main:app --reload --port 8000
Test the stream with curl:
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"Hello, who are you?"}]}'
You should see tokens arriving one by one, each prefixed with data: .
Adding authentication
Before deploying, add an API key check so only your mobile app can call the endpoint:
from fastapi import Header
async def chat(request: ChatRequest, x_api_key: str = Header(...)):
if x_api_key != os.getenv("APP_API_KEY"):
raise HTTPException(status_code=401, detail="Unauthorized")
# ... rest of handler
Generate a random key with openssl rand -hex 32 and store it in your Expo app via expo-constants or a secure environment config—never hardcoded in source.
Part 2: React Native Chat UI (Expo SDK 54)
Project setup
npx create-expo-app ChatbotApp --template blank-typescript
cd ChatbotApp
npx expo install expo-constants
Reading SSE streams in React Native
React Native’s fetch does not expose a native EventSource interface, but it does support reading the response body incrementally as chunks arrive. The trick is to read the ReadableStream from response.body using a TextDecoder.
// hooks/useChat.ts
import { useState, useCallback } from "react";
export interface Message {
id: string;
role: "user" | "assistant";
content: string;
}
const API_URL = process.env.EXPO_PUBLIC_API_URL ?? "http://localhost:8000";
const API_KEY = process.env.EXPO_PUBLIC_APP_API_KEY ?? "";
export function useChat() {
const [messages, setMessages] = useState<Message[]>([]);
const [isStreaming, setIsStreaming] = useState(false);
const sendMessage = useCallback(async (text: string) => {
const userMessage: Message = {
id: Date.now().toString(),
role: "user",
content: text,
};
const updatedMessages = [...messages, userMessage];
setMessages(updatedMessages);
setIsStreaming(true);
// Placeholder for the assistant's streaming reply
const assistantId = (Date.now() + 1).toString();
setMessages((prev) => [
...prev,
{ id: assistantId, role: "assistant", content: "" },
]);
try {
const response = await fetch(`${API_URL}/chat`, {
method: "POST",
headers: {
"Content-Type": "application/json",
"x-api-key": API_KEY,
},
body: JSON.stringify({
messages: updatedMessages.map(({ role, content }) => ({
role,
content,
})),
}),
// React Native 0.79+ supports body streaming
reactNative: { textStreaming: true },
} as RequestInit);
const reader = response.body?.getReader();
const decoder = new TextDecoder();
while (reader) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value, { stream: true });
// Parse SSE lines
const lines = chunk.split("\n");
for (const line of lines) {
if (line.startsWith("data: ")) {
const token = line.slice(6);
if (token === "[DONE]") break;
setMessages((prev) =>
prev.map((m) =>
m.id === assistantId
? { ...m, content: m.content + token }
: m
)
);
}
}
}
} catch (err) {
console.error("Stream error:", err);
} finally {
setIsStreaming(false);
}
}, [messages]);
return { messages, sendMessage, isStreaming };
}
Chat screen
// app/index.tsx
import { useState, useRef } from "react";
import {
View,
Text,
TextInput,
TouchableOpacity,
FlatList,
KeyboardAvoidingView,
Platform,
StyleSheet,
ActivityIndicator,
} from "react-native";
import { useChat } from "../hooks/useChat";
export default function ChatScreen() {
const { messages, sendMessage, isStreaming } = useChat();
const [input, setInput] = useState("");
const listRef = useRef<FlatList>(null);
const handleSend = async () => {
const text = input.trim();
if (!text || isStreaming) return;
setInput("");
await sendMessage(text);
};
return (
<KeyboardAvoidingView
style={styles.container}
behavior={Platform.OS === "ios" ? "padding" : "height"}
keyboardVerticalOffset={90}
>
<FlatList
ref={listRef}
data={messages}
keyExtractor={(m) => m.id}
onContentSizeChange={() => listRef.current?.scrollToEnd()}
renderItem={({ item }) => (
<View
style={[
styles.bubble,
item.role === "user" ? styles.userBubble : styles.aiBubble,
]}
>
<Text style={styles.bubbleText}>{item.content}</Text>
</View>
)}
/>
<View style={styles.inputRow}>
<TextInput
style={styles.input}
value={input}
onChangeText={setInput}
placeholder="Type a message..."
multiline
onSubmitEditing={handleSend}
/>
<TouchableOpacity
style={[styles.sendBtn, isStreaming && styles.sendBtnDisabled]}
onPress={handleSend}
disabled={isStreaming}
>
{isStreaming ? (
<ActivityIndicator color="#fff" size="small" />
) : (
<Text style={styles.sendText}>Send</Text>
)}
</TouchableOpacity>
</View>
</KeyboardAvoidingView>
);
}
const styles = StyleSheet.create({
container: { flex: 1, backgroundColor: "#f5f5f5" },
bubble: { margin: 8, padding: 12, borderRadius: 16, maxWidth: "80%" },
userBubble: { alignSelf: "flex-end", backgroundColor: "#0066ff" },
aiBubble: { alignSelf: "flex-start", backgroundColor: "#ffffff" },
bubbleText: { fontSize: 15, lineHeight: 22 },
inputRow: {
flexDirection: "row",
padding: 8,
backgroundColor: "#fff",
borderTopWidth: 1,
borderColor: "#e0e0e0",
},
input: { flex: 1, fontSize: 15, paddingHorizontal: 12, maxHeight: 100 },
sendBtn: {
backgroundColor: "#0066ff",
borderRadius: 20,
paddingHorizontal: 18,
justifyContent: "center",
},
sendBtnDisabled: { backgroundColor: "#aaa" },
sendText: { color: "#fff", fontWeight: "600" },
});
Part 3: Handling the Mobile-Specific Gotchas
Network drops mid-stream. Mobile connections drop. Wrap your reader.read() loop in a try/catch and show a "Tap to retry" button if the stream dies before [DONE]. Store the partial reply so the user doesn’t lose what was already rendered.
FlatList re-render performance. Every token appends to message content, triggering a re-render. Keep renderItem memoized with useCallback and set removeClippedSubviews on the FlatList. For conversations over 50 messages, consider windowing.
Background / foreground transitions. On iOS, apps suspended in the background will have their network connections dropped. Detect app state changes with AppState and resume or restart the request if needed.
API key exposure. Even with the x-api-key header pattern, your key lives inside the app bundle. For higher-security apps, implement short-lived tokens: the mobile app authenticates your backend via your normal auth system (JWT, Supabase, Firebase), and the backend issues a 15-minute token for the chat endpoint.
Deploying the FastAPI Backend
For production, deploy to any container host. A minimal setup with Docker:
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Recommended platforms for ASEAN-region latency: AWS Singapore (ap-southeast-1), Google Cloud asia-southeast1, or Railway (fastest cold starts for early-stage projects). If you need data residency for Thai PDPA or Japanese APPI compliance, pin your deployment region and enable VPC private endpoints so LLM traffic never traverses the public internet.
Swapping to a Private LLM
One of the reasons to build a FastAPI layer is how easy the provider swap becomes. If a client wants to run a private LLM on their own infrastructure:
# Replace the Anthropic client with an OpenAI-compatible client
# pointing at Ollama, vLLM, or LiteLLM running on-premise
from openai import OpenAI
client = OpenAI(
base_url="http://your-private-llm-server:11434/v1",
api_key="not-needed", # Ollama ignores this
)
The React Native app does not change at all. The streaming protocol stays identical.
This is the architecture we use at Simplico when connecting our simpliDoc RAG layer to mobile applications—users get a chatbot that answers questions against private documents without any data leaving the client’s own infrastructure.
FAQ
Do I need a FastAPI backend, or can I call the LLM API directly from React Native?
You can call it directly, but you shouldn’t in production. API keys embedded in a mobile app can be extracted by anyone who decompiles the binary. A backend also lets you enforce rate limits, inject system context (user roles, company data), and swap models without an app store release.
React Native’s fetch doesn’t support EventSource—how does streaming work?
As of React Native 0.79+, response.body.getReader() works for incremental reads when you pass reactNative: { textStreaming: true } in the fetch options. The SSE data arrives as chunked text; you parse the data: prefix yourself. This is exactly what the code in Part 2 does.
Which model should I use for a support chatbot with high message volume?
Start with Claude Haiku 4.5. At $1.00/M input tokens it is designed for this use case and the 200K context window comfortably handles long chat histories. Only move up to Sonnet 4.6 if your conversations require complex multi-step reasoning or document synthesis.
How do I add conversation memory without sending the full history every time?
Use a sliding window: send only the last N messages (typically 10–20 turns). For longer-term memory, embed key facts from earlier turns into the system prompt using a summarization step. This pattern is covered in the simpliDoc RAG series.
Can I use this pattern with streaming on Android and iOS equally?
Yes. The chunked fetch approach works on both platforms with React Native 0.81 and Expo SDK 54. The textStreaming: true option is a React Native-specific hint to the JSI fetch implementation—it has no effect on web, where SSE streaming is native.
What’s Next
This post built the foundation: a streaming FastAPI backend, a production-ready Expo chat UI, and the mobile-specific handling that tutorials usually skip.
The natural next steps in the R-series:
- On-device AI — running a quantized model directly on the device with no backend required, using Expo’s ML integration and TensorFlow Lite
- Connecting the chatbot to your data — integrating the FastAPI backend with a RAG pipeline (pgvector + private documents) so the chatbot answers questions about your company’s content
Have a React Native project that needs an AI layer? Contact the Simplico team — we build production mobile AI features for clients across Southeast Asia and Japan.
Latest Posts
- Your Staff Have 24 Passwords. Your Business Has 24 Attack Surfaces. June 11, 2026
- The Security Risk Sitting Quietly in Your Engineering Org June 8, 2026
- SOAR and Alert Fatigue: Why Your SOC Is Drowning in Alerts (and How Automation Actually Helps) June 7, 2026
- MES vs ERP: What’s the Difference and Which Does Your Factory Actually Need? June 7, 2026
- React Native vs Flutter in 2026: How to Actually Choose June 4, 2026
- React Native in 2026: Is It Still Worth Building With? June 3, 2026
