On-device LLM inference on phones went from research to platform-supported in 2024-2025. Apple's Foundation Models framework and Google's Gemini Nano API made small-model inference a first-class platform capability. Knowing what's available — and what's not — shapes mobile app design in 2026.

Advertisement

Apple Foundation Models

3B-parameter on-device model + per-app fine-tunes via LoRA-style adapters. Accessible via Swift API. ~30-50 tokens/sec on iPhone 15+. Private (data never leaves device). Capabilities: text summary, classification, generation, tool use. Limits: 4K context, no long-form generation.

Android Gemini Nano

Similar tier — smaller models embedded in the OS. Google APIs expose them on Pixel and increasingly other Android devices. AICore service runs them. Not yet universally available across Android device range — fragmentation is real.

Advertisement

Cross-platform via Ollama / MLC / Llama.cpp

When platform APIs don't fit (need a specific model, need broader Android support), bundle a runtime. Ollama and MLC offer mobile builds. Llama.cpp ports work on iOS/Android. Higher app size cost (50-200MB).

Use cases that fit

Smart compose, autocorrect with intent, brief summaries, structured extraction from user input, voice agents with local STT, simple classification (spam, sentiment). Fast, private, offline.

Use cases that don't

Long-form writing assistance. Complex reasoning. Real-time chat (latency OK; quality won't match server). Knowledge-base Q&A requiring large knowledge. Fall back to server-side LLM for these.

Apple FM and Gemini Nano for first-class on-device. Bundled runtimes for cross-platform. Fit use cases to small-model strengths.