Apple & Local AI: CoreML 2025 Updates
Apple’s been quiet on generative AI compared to OpenAI and Google. But they’ve been busy with on-device AI. CoreML 2025 brings significant updates for running AI locally on Apple silicon.
What’s New in CoreML 2025
Larger Model Support
| Previous Limit | 2025 Limit |
|---|---|
| ~4GB models | ~30GB models |
| LLMs impractical | 7B+ LLMs possible |
With 8GB and 16GB unified memory on newer devices, substantial LLMs run locally.
Transformer Optimizations
Native acceleration for attention mechanisms:
let config = MLModelConfiguration()
config.computeUnits = .all // Neural Engine + GPU + CPU
// Transformer models now run 2-3x faster
let model = try await MLModel.load(at: modelURL, configuration: config)
Token Streaming
Real-time token generation for chat interfaces:
class LLMSession: ObservableObject {
@Published var output: String = ""
func generate(prompt: String) async {
for try await token in model.generateTokens(prompt) {
await MainActor.run {
self.output += token
}
}
}
}
Quantization Support
4-bit and 8-bit quantization built-in:
# Convert with coremltools
import coremltools as ct
model = ct.convert(
pytorch_model,
convert_to="mlprogram",
compute_precision=ct.precision.INT4
)
On-Device LLMs
What’s Possible
| Model Size | RAM Needed | Devices |
|---|---|---|
| 1.5B | 3GB | iPhone 15+, M1+ Macs |
| 3B | 6GB | iPhone 15 Pro+, M1+ |
| 7B | 12GB | M1 Pro+, 16GB devices |
| 13B+ | 24GB+ | M1 Max+, M3 Max |
Apple’s On-Device Models
Apple Intelligence includes:
- Writing assistance
- Summarization
- Image generation (Image Playground)
- Email priority
- Smart replies
All running locally on A17 Pro / M1+.
Integration Patterns
SwiftUI + CoreML
struct ChatView: View {
@StateObject var session = LLMSession()
@State var input = ""
var body: some View {
VStack {
ScrollView {
Text(session.output)
}
HStack {
TextField("Message", text: $input)
Button("Send") {
Task {
await session.generate(prompt: input)
input = ""
}
}
}
}
}
}
Background Processing
// Long-running inference in background
let task = Task(priority: .userInitiated) {
let result = try await model.prediction(from: inputFeatures)
return result
}
// Cancel if user navigates away
task.cancel()
Memory Management
class ModelManager {
private var model: MLModel?
func loadIfNeeded() async throws {
if model == nil {
model = try await MLModel.load(at: modelURL)
}
}
func unload() {
model = nil
}
}
Vision + Language
Image Understanding
let image = CIImage(image: uiImage)!
let request = VNGenerateImageFeaturePrintRequest()
let handler = VNImageRequestHandler(ciImage: image)
try handler.perform([request])
// Feed to multimodal model
let features = request.results?.first?.inputFeatures
let response = try await vlModel.prediction(image: features, text: "Describe this image")
Document Analysis
let request = VNRecognizeTextRequest { request, error in
guard let observations = request.results as? [VNRecognizedTextObservation] else { return }
let text = observations.compactMap { $0.topCandidates(1).first?.string }.joined(separator: "\n")
// Summarize with LLM
Task {
let summary = try await summarizer.summarize(text)
}
}
Privacy Benefits
Why Local Matters
- No network latency: Instant responses
- Privacy: Data never leaves device
- Offline: Works without connectivity
- Cost: No API fees
Apple’s Privacy Positioning
User data + On-device model = Private AI
vs.
User data → Cloud API → Response (data exposure)
Enterprise and privacy-conscious users prefer local.
Convert Your Models
From PyTorch
import coremltools as ct
import torch
# Export to CoreML
traced = torch.jit.trace(model, example_input)
mlmodel = ct.convert(
traced,
inputs=[ct.TensorType(shape=example_input.shape)],
convert_to="mlprogram",
minimum_deployment_target=ct.target.iOS18
)
mlmodel.save("model.mlpackage")
From Hugging Face
from transformers import AutoTokenizer, AutoModelForCausalLM
from coremltools import convert
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-3-mini")
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-3-mini")
# Export to CoreML
mlmodel = convert_to_coreml(model, tokenizer)
From ONNX
import coremltools as ct
mlmodel = ct.converters.onnx.convert(
model="model.onnx",
minimum_deployment_target=ct.target.iOS18
)
Best Practices
Model Selection
| Use Case | Model Size | Quality |
|---|---|---|
| Simple tasks | 1-3B | Good enough |
| General chat | 7B | Good |
| Complex reasoning | 13B+ | Best |
Start small, scale only if needed.
Thermal Management
// Monitor thermal state
let process = ProcessInfo.processInfo
if process.thermalState == .critical {
// Reduce inference rate or quality
useQuantizedModel()
}
Battery Considerations
// Check power state
if process.isLowPowerModeEnabled {
// Defer non-essential inference
skipBackgroundProcessing()
}
Limitations
What Doesn’t Work Well (Yet)
- Very large models (70B+)
- Training on device (inference only)
- Complex multi-model pipelines
- Some attention variants
Comparison with Cloud
| Aspect | On-Device | Cloud API |
|---|---|---|
| Latency | Low | Network dependent |
| Privacy | High | Lower |
| Model size | Limited | Unlimited |
| Cost | Free after device | Per-token |
| Updates | App update needed | Instant |
Final Thoughts
Apple’s on-device AI strategy trades model size for privacy and latency. For many use cases, a local 3B model beats a cloud 70B model on user experience.
Build with CoreML. Respect privacy. Ship AI features that work offline.
The best AI is the one that doesn’t need to phone home.