On-device LLM inference on phones went from research to platform-supported in 2024-2025. Apple's Foundation Models framework and Google's Gemini Nano API made small-model inference a first-class platform capability. Knowing what's available — and what's not — shapes mobile app design in 2026.
Apple Foundation Models
3B-parameter on-device model + per-app fine-tunes via LoRA-style adapters. Accessible via Swift API. ~30-50 tokens/sec on iPhone 15+. Private (data never leaves device). Capabilities: text summary, classification, generation, tool use. Limits: 4K context, no long-form generation.
Android Gemini Nano
Similar tier — smaller models embedded in the OS. Google APIs expose them on Pixel and increasingly other Android devices. AICore service runs them. Not yet universally available across Android device range — fragmentation is real.
Cross-platform via Ollama / MLC / Llama.cpp
When platform APIs don't fit (need a specific model, need broader Android support), bundle a runtime. Ollama and MLC offer mobile builds. Llama.cpp ports work on iOS/Android. Higher app size cost (50-200MB).
Use cases that fit
Smart compose, autocorrect with intent, brief summaries, structured extraction from user input, voice agents with local STT, simple classification (spam, sentiment). Fast, private, offline.
Use cases that don't
Long-form writing assistance. Complex reasoning. Real-time chat (latency OK; quality won't match server). Knowledge-base Q&A requiring large knowledge. Fall back to server-side LLM for these.