Apple & Local AI: CoreML 2025 Updates

ai mobile dev apple

Apple’s been quiet on generative AI compared to OpenAI and Google. But they’ve been busy with on-device AI. CoreML 2025 brings significant updates for running AI locally on Apple silicon.

What’s New in CoreML 2025

Larger Model Support

Previous Limit2025 Limit
~4GB models~30GB models
LLMs impractical7B+ LLMs possible

With 8GB and 16GB unified memory on newer devices, substantial LLMs run locally.

Transformer Optimizations

Native acceleration for attention mechanisms:

let config = MLModelConfiguration()
config.computeUnits = .all  // Neural Engine + GPU + CPU

// Transformer models now run 2-3x faster
let model = try await MLModel.load(at: modelURL, configuration: config)

Token Streaming

Real-time token generation for chat interfaces:

class LLMSession: ObservableObject {
    @Published var output: String = ""
    
    func generate(prompt: String) async {
        for try await token in model.generateTokens(prompt) {
            await MainActor.run {
                self.output += token
            }
        }
    }
}

Quantization Support

4-bit and 8-bit quantization built-in:

# Convert with coremltools
import coremltools as ct

model = ct.convert(
    pytorch_model,
    convert_to="mlprogram",
    compute_precision=ct.precision.INT4
)

On-Device LLMs

What’s Possible

Model SizeRAM NeededDevices
1.5B3GBiPhone 15+, M1+ Macs
3B6GBiPhone 15 Pro+, M1+
7B12GBM1 Pro+, 16GB devices
13B+24GB+M1 Max+, M3 Max

Apple’s On-Device Models

Apple Intelligence includes:

All running locally on A17 Pro / M1+.

Integration Patterns

SwiftUI + CoreML

struct ChatView: View {
    @StateObject var session = LLMSession()
    @State var input = ""
    
    var body: some View {
        VStack {
            ScrollView {
                Text(session.output)
            }
            
            HStack {
                TextField("Message", text: $input)
                Button("Send") {
                    Task {
                        await session.generate(prompt: input)
                        input = ""
                    }
                }
            }
        }
    }
}

Background Processing

// Long-running inference in background
let task = Task(priority: .userInitiated) {
    let result = try await model.prediction(from: inputFeatures)
    return result
}

// Cancel if user navigates away
task.cancel()

Memory Management

class ModelManager {
    private var model: MLModel?
    
    func loadIfNeeded() async throws {
        if model == nil {
            model = try await MLModel.load(at: modelURL)
        }
    }
    
    func unload() {
        model = nil
    }
}

Vision + Language

Image Understanding

let image = CIImage(image: uiImage)!
let request = VNGenerateImageFeaturePrintRequest()
let handler = VNImageRequestHandler(ciImage: image)
try handler.perform([request])

// Feed to multimodal model
let features = request.results?.first?.inputFeatures
let response = try await vlModel.prediction(image: features, text: "Describe this image")

Document Analysis

let request = VNRecognizeTextRequest { request, error in
    guard let observations = request.results as? [VNRecognizedTextObservation] else { return }
    let text = observations.compactMap { $0.topCandidates(1).first?.string }.joined(separator: "\n")
    
    // Summarize with LLM
    Task {
        let summary = try await summarizer.summarize(text)
    }
}

Privacy Benefits

Why Local Matters

Apple’s Privacy Positioning

User data + On-device model = Private AI
vs.
User data → Cloud API → Response (data exposure)

Enterprise and privacy-conscious users prefer local.

Convert Your Models

From PyTorch

import coremltools as ct
import torch

# Export to CoreML
traced = torch.jit.trace(model, example_input)
mlmodel = ct.convert(
    traced,
    inputs=[ct.TensorType(shape=example_input.shape)],
    convert_to="mlprogram",
    minimum_deployment_target=ct.target.iOS18
)
mlmodel.save("model.mlpackage")

From Hugging Face

from transformers import AutoTokenizer, AutoModelForCausalLM
from coremltools import convert

model = AutoModelForCausalLM.from_pretrained("microsoft/phi-3-mini")
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-3-mini")

# Export to CoreML
mlmodel = convert_to_coreml(model, tokenizer)

From ONNX

import coremltools as ct

mlmodel = ct.converters.onnx.convert(
    model="model.onnx",
    minimum_deployment_target=ct.target.iOS18
)

Best Practices

Model Selection

Use CaseModel SizeQuality
Simple tasks1-3BGood enough
General chat7BGood
Complex reasoning13B+Best

Start small, scale only if needed.

Thermal Management

// Monitor thermal state
let process = ProcessInfo.processInfo
if process.thermalState == .critical {
    // Reduce inference rate or quality
    useQuantizedModel()
}

Battery Considerations

// Check power state
if process.isLowPowerModeEnabled {
    // Defer non-essential inference
    skipBackgroundProcessing()
}

Limitations

What Doesn’t Work Well (Yet)

Comparison with Cloud

AspectOn-DeviceCloud API
LatencyLowNetwork dependent
PrivacyHighLower
Model sizeLimitedUnlimited
CostFree after devicePer-token
UpdatesApp update neededInstant

Final Thoughts

Apple’s on-device AI strategy trades model size for privacy and latency. For many use cases, a local 3B model beats a cloud 70B model on user experience.

Build with CoreML. Respect privacy. Ship AI features that work offline.


The best AI is the one that doesn’t need to phone home.

All posts