AI Chatbot

How to Add an AI Chatbot to Your React Native App (with FastAPI Backend)

Most React Native tutorials stop at the UI layer. They show you how to render chat bubbles and handle keyboard offsets—then hand-wave the backend with a vague "call the OpenAI API from your app."

That approach has two problems. First, you’re putting your API key inside a mobile binary anyone can extract. Second, you have no server-side control: no rate limiting, no user context injection, no logging, no ability to swap models without pushing an app update.

This guide takes a production-minded path. We build a FastAPI backend that handles the LLM connection with streaming Server-Sent Events (SSE), then wire it to an Expo (React Native) front-end that renders responses token by token. The same backend pattern integrates cleanly with private LLM deployments—so if your client eventually wants to run inference on-premise, you change one URL in the config, not the entire codebase.


The Architecture

flowchart TD
  A["Mobile App Expo SDK 54"] --> B["FastAPI Backend"]
  B --> C["LLM Provider"]
  C --> D["SSE Stream"]
  D --> E["Chunked Fetch RN 0.81"]
  E --> F["Chat UI renders tokens"]

Key design decisions:

FastAPI over a serverless route. FastAPI gives you WebSocket support, dependency injection for auth middleware, and easy integration with Python-based private LLMs (Ollama, vLLM, LiteLLM). If you’re building for enterprise clients who may later need on-premise AI, a Python backend is the right long-term choice.

SSE over WebSockets for streaming. SSE is one-directional and HTTP/1.1-compatible, which makes it simpler to proxy, cache, and load-balance. React Native’s fetch API (as of RN 0.81) supports reading chunked responses incrementally—the same result without needing a WebSocket library.

LLM-agnostic backend. The provider is an env-var swap. Your mobile app never needs to know whether it’s hitting Claude, GPT-5, or a private Llama deployment.


Choosing a Model for Mobile Chatbots

Before writing code, pick your model. In June 2026 the price/performance landscape looks like this:

Model Input / 1M tokens Output / 1M tokens Best for
Claude Haiku 4.5 $1.00 $5.00 High-volume mobile chatbots, FAQ bots
Claude Sonnet 4.6 $3.00 $15.00 Complex reasoning, multi-turn sales assistants
GPT-5.4 $2.50 $15.00 OpenAI-native toolchains
DeepSeek V4 Flash $0.14 $0.28 Cost-sensitive ASEAN deployments

For most mobile chatbots—support bots, onboarding assistants, FAQ handlers—Claude Haiku 4.5 hits the right balance. Its 200K context window comfortably holds long conversation histories, and at $1.00/M input tokens it costs roughly 60× less per conversation than Sonnet. If your app needs multi-step reasoning or nuanced synthesis (a document assistant built on top of simpliDoc, for example), step up to Sonnet 4.6.


Part 1: FastAPI Backend

Project setup

mkdir chatbot-api && cd chatbot-api
python -m venv .venv && source .venv/bin/activate
pip install fastapi uvicorn anthropic python-dotenv

Create a .env file:

ANTHROPIC_API_KEY=your_key_here
MODEL_ID=claude-haiku-4-5
SYSTEM_PROMPT="You are a helpful assistant for Acme Corp."

Streaming chat endpoint

# main.py
import os
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from typing import List
import anthropic
from dotenv import load_dotenv

load_dotenv()

app = FastAPI()
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
MODEL = os.getenv("MODEL_ID", "claude-haiku-4-5")
SYSTEM = os.getenv("SYSTEM_PROMPT", "You are a helpful assistant.")

class Message(BaseModel):
    role: str   # "user" or "assistant"
    content: str

class ChatRequest(BaseModel):
    messages: List[Message]

def stream_response(messages: List[Message]):
    """Generator that yields SSE-formatted tokens."""
    with client.messages.stream(
        model=MODEL,
        max_tokens=1024,
        system=SYSTEM,
        messages=[m.model_dump() for m in messages],
    ) as stream:
        for text in stream.text_stream:
            # SSE format: each chunk prefixed with "data: "
            yield f"data: {text}\n\n"
    yield "data: [DONE]\n\n"

@app.post("/chat")
async def chat(request: ChatRequest):
    if not request.messages:
        raise HTTPException(status_code=400, detail="messages cannot be empty")
    return StreamingResponse(
        stream_response(request.messages),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
    )

@app.get("/health")
async def health():
    return {"status": "ok", "model": MODEL}

Run locally:

uvicorn main:app --reload --port 8000

Test the stream with curl:

curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello, who are you?"}]}'

You should see tokens arriving one by one, each prefixed with data: .

Adding authentication

Before deploying, add an API key check so only your mobile app can call the endpoint:

from fastapi import Header

async def chat(request: ChatRequest, x_api_key: str = Header(...)):
    if x_api_key != os.getenv("APP_API_KEY"):
        raise HTTPException(status_code=401, detail="Unauthorized")
    # ... rest of handler

Generate a random key with openssl rand -hex 32 and store it in your Expo app via expo-constants or a secure environment config—never hardcoded in source.


Part 2: React Native Chat UI (Expo SDK 54)

Project setup

npx create-expo-app ChatbotApp --template blank-typescript
cd ChatbotApp
npx expo install expo-constants

Reading SSE streams in React Native

React Native’s fetch does not expose a native EventSource interface, but it does support reading the response body incrementally as chunks arrive. The trick is to read the ReadableStream from response.body using a TextDecoder.

// hooks/useChat.ts
import { useState, useCallback } from "react";

export interface Message {
  id: string;
  role: "user" | "assistant";
  content: string;
}

const API_URL = process.env.EXPO_PUBLIC_API_URL ?? "http://localhost:8000";
const API_KEY = process.env.EXPO_PUBLIC_APP_API_KEY ?? "";

export function useChat() {
  const [messages, setMessages] = useState<Message[]>([]);
  const [isStreaming, setIsStreaming] = useState(false);

  const sendMessage = useCallback(async (text: string) => {
    const userMessage: Message = {
      id: Date.now().toString(),
      role: "user",
      content: text,
    };

    const updatedMessages = [...messages, userMessage];
    setMessages(updatedMessages);
    setIsStreaming(true);

    // Placeholder for the assistant's streaming reply
    const assistantId = (Date.now() + 1).toString();
    setMessages((prev) => [
      ...prev,
      { id: assistantId, role: "assistant", content: "" },
    ]);

    try {
      const response = await fetch(`${API_URL}/chat`, {
        method: "POST",
        headers: {
          "Content-Type": "application/json",
          "x-api-key": API_KEY,
        },
        body: JSON.stringify({
          messages: updatedMessages.map(({ role, content }) => ({
            role,
            content,
          })),
        }),
        // React Native 0.79+ supports body streaming
        reactNative: { textStreaming: true },
      } as RequestInit);

      const reader = response.body?.getReader();
      const decoder = new TextDecoder();

      while (reader) {
        const { done, value } = await reader.read();
        if (done) break;

        const chunk = decoder.decode(value, { stream: true });
        // Parse SSE lines
        const lines = chunk.split("\n");
        for (const line of lines) {
          if (line.startsWith("data: ")) {
            const token = line.slice(6);
            if (token === "[DONE]") break;
            setMessages((prev) =>
              prev.map((m) =>
                m.id === assistantId
                  ? { ...m, content: m.content + token }
                  : m
              )
            );
          }
        }
      }
    } catch (err) {
      console.error("Stream error:", err);
    } finally {
      setIsStreaming(false);
    }
  }, [messages]);

  return { messages, sendMessage, isStreaming };
}

Chat screen

// app/index.tsx
import { useState, useRef } from "react";
import {
  View,
  Text,
  TextInput,
  TouchableOpacity,
  FlatList,
  KeyboardAvoidingView,
  Platform,
  StyleSheet,
  ActivityIndicator,
} from "react-native";
import { useChat } from "../hooks/useChat";

export default function ChatScreen() {
  const { messages, sendMessage, isStreaming } = useChat();
  const [input, setInput] = useState("");
  const listRef = useRef<FlatList>(null);

  const handleSend = async () => {
    const text = input.trim();
    if (!text || isStreaming) return;
    setInput("");
    await sendMessage(text);
  };

  return (
    <KeyboardAvoidingView
      style={styles.container}
      behavior={Platform.OS === "ios" ? "padding" : "height"}
      keyboardVerticalOffset={90}
    >
      <FlatList
        ref={listRef}
        data={messages}
        keyExtractor={(m) => m.id}
        onContentSizeChange={() => listRef.current?.scrollToEnd()}
        renderItem={({ item }) => (
          <View
            style={[
              styles.bubble,
              item.role === "user" ? styles.userBubble : styles.aiBubble,
            ]}
          >
            <Text style={styles.bubbleText}>{item.content}</Text>
          </View>
        )}
      />
      <View style={styles.inputRow}>
        <TextInput
          style={styles.input}
          value={input}
          onChangeText={setInput}
          placeholder="Type a message..."
          multiline
          onSubmitEditing={handleSend}
        />
        <TouchableOpacity
          style={[styles.sendBtn, isStreaming && styles.sendBtnDisabled]}
          onPress={handleSend}
          disabled={isStreaming}
        >
          {isStreaming ? (
            <ActivityIndicator color="#fff" size="small" />
          ) : (
            <Text style={styles.sendText}>Send</Text>
          )}
        </TouchableOpacity>
      </View>
    </KeyboardAvoidingView>
  );
}

const styles = StyleSheet.create({
  container: { flex: 1, backgroundColor: "#f5f5f5" },
  bubble: { margin: 8, padding: 12, borderRadius: 16, maxWidth: "80%" },
  userBubble: { alignSelf: "flex-end", backgroundColor: "#0066ff" },
  aiBubble: { alignSelf: "flex-start", backgroundColor: "#ffffff" },
  bubbleText: { fontSize: 15, lineHeight: 22 },
  inputRow: {
    flexDirection: "row",
    padding: 8,
    backgroundColor: "#fff",
    borderTopWidth: 1,
    borderColor: "#e0e0e0",
  },
  input: { flex: 1, fontSize: 15, paddingHorizontal: 12, maxHeight: 100 },
  sendBtn: {
    backgroundColor: "#0066ff",
    borderRadius: 20,
    paddingHorizontal: 18,
    justifyContent: "center",
  },
  sendBtnDisabled: { backgroundColor: "#aaa" },
  sendText: { color: "#fff", fontWeight: "600" },
});

Part 3: Handling the Mobile-Specific Gotchas

Network drops mid-stream. Mobile connections drop. Wrap your reader.read() loop in a try/catch and show a "Tap to retry" button if the stream dies before [DONE]. Store the partial reply so the user doesn’t lose what was already rendered.

FlatList re-render performance. Every token appends to message content, triggering a re-render. Keep renderItem memoized with useCallback and set removeClippedSubviews on the FlatList. For conversations over 50 messages, consider windowing.

Background / foreground transitions. On iOS, apps suspended in the background will have their network connections dropped. Detect app state changes with AppState and resume or restart the request if needed.

API key exposure. Even with the x-api-key header pattern, your key lives inside the app bundle. For higher-security apps, implement short-lived tokens: the mobile app authenticates your backend via your normal auth system (JWT, Supabase, Firebase), and the backend issues a 15-minute token for the chat endpoint.


Deploying the FastAPI Backend

For production, deploy to any container host. A minimal setup with Docker:

FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Recommended platforms for ASEAN-region latency: AWS Singapore (ap-southeast-1), Google Cloud asia-southeast1, or Railway (fastest cold starts for early-stage projects). If you need data residency for Thai PDPA or Japanese APPI compliance, pin your deployment region and enable VPC private endpoints so LLM traffic never traverses the public internet.


Swapping to a Private LLM

One of the reasons to build a FastAPI layer is how easy the provider swap becomes. If a client wants to run a private LLM on their own infrastructure:

# Replace the Anthropic client with an OpenAI-compatible client
# pointing at Ollama, vLLM, or LiteLLM running on-premise
from openai import OpenAI

client = OpenAI(
    base_url="http://your-private-llm-server:11434/v1",
    api_key="not-needed",  # Ollama ignores this
)

The React Native app does not change at all. The streaming protocol stays identical.

This is the architecture we use at Simplico when connecting our simpliDoc RAG layer to mobile applications—users get a chatbot that answers questions against private documents without any data leaving the client’s own infrastructure.


FAQ

Do I need a FastAPI backend, or can I call the LLM API directly from React Native?

You can call it directly, but you shouldn’t in production. API keys embedded in a mobile app can be extracted by anyone who decompiles the binary. A backend also lets you enforce rate limits, inject system context (user roles, company data), and swap models without an app store release.

React Native’s fetch doesn’t support EventSource—how does streaming work?

As of React Native 0.79+, response.body.getReader() works for incremental reads when you pass reactNative: { textStreaming: true } in the fetch options. The SSE data arrives as chunked text; you parse the data: prefix yourself. This is exactly what the code in Part 2 does.

Which model should I use for a support chatbot with high message volume?

Start with Claude Haiku 4.5. At $1.00/M input tokens it is designed for this use case and the 200K context window comfortably handles long chat histories. Only move up to Sonnet 4.6 if your conversations require complex multi-step reasoning or document synthesis.

How do I add conversation memory without sending the full history every time?

Use a sliding window: send only the last N messages (typically 10–20 turns). For longer-term memory, embed key facts from earlier turns into the system prompt using a summarization step. This pattern is covered in the simpliDoc RAG series.

Can I use this pattern with streaming on Android and iOS equally?

Yes. The chunked fetch approach works on both platforms with React Native 0.81 and Expo SDK 54. The textStreaming: true option is a React Native-specific hint to the JSI fetch implementation—it has no effect on web, where SSE streaming is native.


What’s Next

This post built the foundation: a streaming FastAPI backend, a production-ready Expo chat UI, and the mobile-specific handling that tutorials usually skip.

The natural next steps in the R-series:

  • On-device AI — running a quantized model directly on the device with no backend required, using Expo’s ML integration and TensorFlow Lite
  • Connecting the chatbot to your data — integrating the FastAPI backend with a RAG pipeline (pgvector + private documents) so the chatbot answers questions about your company’s content

Have a React Native project that needs an AI layer? Contact the Simplico team — we build production mobile AI features for clients across Southeast Asia and Japan.