Apple & Local AI: CoreML 2025 Updates

August 27, 2025

ai mobile dev apple

Apple’s been quiet on generative AI compared to OpenAI and Google. But they’ve been busy with on-device AI. CoreML 2025 brings significant updates for running AI locally on Apple silicon.

What’s New in CoreML 2025

Larger Model Support

Previous Limit	2025 Limit
~4GB models	~30GB models
LLMs impractical	7B+ LLMs possible

With 8GB and 16GB unified memory on newer devices, substantial LLMs run locally.

Transformer Optimizations

Native acceleration for attention mechanisms:

let config = MLModelConfiguration()
config.computeUnits = .all  // Neural Engine + GPU + CPU

// Transformer models now run 2-3x faster
let model = try await MLModel.load(at: modelURL, configuration: config)

Token Streaming

Real-time token generation for chat interfaces:

class LLMSession: ObservableObject {
    @Published var output: String = ""
    
    func generate(prompt: String) async {
        for try await token in model.generateTokens(prompt) {
            await MainActor.run {
                self.output += token
            }
        }
    }
}

Quantization Support

4-bit and 8-bit quantization built-in:

# Convert with coremltools
import coremltools as ct

model = ct.convert(
    pytorch_model,
    convert_to="mlprogram",
    compute_precision=ct.precision.INT4
)

On-Device LLMs

What’s Possible

Model Size	RAM Needed	Devices
1.5B	3GB	iPhone 15+, M1+ Macs
3B	6GB	iPhone 15 Pro+, M1+
7B	12GB	M1 Pro+, 16GB devices
13B+	24GB+	M1 Max+, M3 Max

Apple’s On-Device Models

Apple Intelligence includes:

Writing assistance
Summarization
Image generation (Image Playground)
Email priority
Smart replies

All running locally on A17 Pro / M1+.

Integration Patterns

SwiftUI + CoreML

struct ChatView: View {
    @StateObject var session = LLMSession()
    @State var input = ""
    
    var body: some View {
        VStack {
            ScrollView {
                Text(session.output)
            }
            
            HStack {
                TextField("Message", text: $input)
                Button("Send") {
                    Task {
                        await session.generate(prompt: input)
                        input = ""
                    }
                }
            }
        }
    }
}

Background Processing

// Long-running inference in background
let task = Task(priority: .userInitiated) {
    let result = try await model.prediction(from: inputFeatures)
    return result
}

// Cancel if user navigates away
task.cancel()

Memory Management

class ModelManager {
    private var model: MLModel?
    
    func loadIfNeeded() async throws {
        if model == nil {
            model = try await MLModel.load(at: modelURL)
        }
    }
    
    func unload() {
        model = nil
    }
}

Vision + Language

Image Understanding

let image = CIImage(image: uiImage)!
let request = VNGenerateImageFeaturePrintRequest()
let handler = VNImageRequestHandler(ciImage: image)
try handler.perform([request])

// Feed to multimodal model
let features = request.results?.first?.inputFeatures
let response = try await vlModel.prediction(image: features, text: "Describe this image")

Document Analysis

let request = VNRecognizeTextRequest { request, error in
    guard let observations = request.results as? [VNRecognizedTextObservation] else { return }
    let text = observations.compactMap { $0.topCandidates(1).first?.string }.joined(separator: "\n")
    
    // Summarize with LLM
    Task {
        let summary = try await summarizer.summarize(text)
    }
}

Privacy Benefits

Why Local Matters

No network latency: Instant responses
Privacy: Data never leaves device
Offline: Works without connectivity
Cost: No API fees

Apple’s Privacy Positioning

User data + On-device model = Private AI
vs.
User data → Cloud API → Response (data exposure)

Enterprise and privacy-conscious users prefer local.

Convert Your Models

From PyTorch

import coremltools as ct
import torch

# Export to CoreML
traced = torch.jit.trace(model, example_input)
mlmodel = ct.convert(
    traced,
    inputs=[ct.TensorType(shape=example_input.shape)],
    convert_to="mlprogram",
    minimum_deployment_target=ct.target.iOS18
)
mlmodel.save("model.mlpackage")

From Hugging Face

from transformers import AutoTokenizer, AutoModelForCausalLM
from coremltools import convert

model = AutoModelForCausalLM.from_pretrained("microsoft/phi-3-mini")
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-3-mini")

# Export to CoreML
mlmodel = convert_to_coreml(model, tokenizer)

From ONNX

import coremltools as ct

mlmodel = ct.converters.onnx.convert(
    model="model.onnx",
    minimum_deployment_target=ct.target.iOS18
)

Best Practices

Model Selection

Use Case	Model Size	Quality
Simple tasks	1-3B	Good enough
General chat	7B	Good
Complex reasoning	13B+	Best

Start small, scale only if needed.

Thermal Management

// Monitor thermal state
let process = ProcessInfo.processInfo
if process.thermalState == .critical {
    // Reduce inference rate or quality
    useQuantizedModel()
}

Battery Considerations

// Check power state
if process.isLowPowerModeEnabled {
    // Defer non-essential inference
    skipBackgroundProcessing()
}

Limitations

What Doesn’t Work Well (Yet)

Very large models (70B+)
Training on device (inference only)
Complex multi-model pipelines
Some attention variants

Comparison with Cloud

Aspect	On-Device	Cloud API
Latency	Low	Network dependent
Privacy	High	Lower
Model size	Limited	Unlimited
Cost	Free after device	Per-token
Updates	App update needed	Instant

Final Thoughts

Apple’s on-device AI strategy trades model size for privacy and latency. For many use cases, a local 3B model beats a cloud 70B model on user experience.

Build with CoreML. Respect privacy. Ship AI features that work offline.

The best AI is the one that doesn’t need to phone home.