Apple Foundation Models Is Now a Hybrid Platform

When I shipped VinylCrate’s AI features, I only had one question to answer per feature, and it was binary: can this run on the phone, or does it not exist?

Album insights, recommendations, collection summaries — the lot, four features running entirely on-device on Apple’s Foundation Models. Every one of them @Generable, no server, no API key, nothing to pay for. That constraint wasn’t a limitation I worked around. It was the design. If a feature couldn’t fit inside the small on-device model, it didn’t make the cut. I wasn’t choosing how much model a feature deserved. I was checking whether the phone could do it at all, and shipping the ones that passed.

That question just changed, and it took me a minute to notice why.

The interesting thing about Apple’s Foundation Models was never the model. It was the seam. When LanguageModelSession shipped, the on-device model was the whole story — small, fast, private, free to call as often as I wanted. Genuinely new. But the part that’s aging well isn’t the weights. It’s the API boundary. Apple drew a line and said this is what talking to a language model looks like on our platform: try await session.respond(to:), streamResponse, @Generable, the Tool protocol. One vocabulary.

What’s happening now is that the thing on the other side of that line is becoming plural. On-device is still there. Apple’s Private Cloud Compute — server-scale models in a stateless, verifiable enclave — sits behind the same session. Frontier third-party models route through it too. Three very different cost, latency, and privacy envelopes, and a call site that barely moves between them.

None of this is a roadmap slide — the hybrid session landed in the current developer beta. The capability is real today; what’s still settling is the exact API for telling a session which tier to use. That’s why the routing code further down is shaped right but isn’t something you can paste yet — and it’s worth reading the rest with “as of the beta” in the back of your head.

For VinylCrate, that reframes the question I thought was settled. I built those four features under “can the phone do it.” Now the question is which of them, if any, has earned more model than the phone can give — and for the first time the answer to “borrow something bigger” isn’t stand up a backend and start paying. It’s a line of policy.

The Call Site That Doesn’t Move

Let me be concrete about what stays identical, because that’s the part that makes the rest worth caring about.

You already know the on-device path. Check availability, make a session, ask.

import FoundationModels

let model = SystemLanguageModel.default

switch model.availability {
case .available:
    break
case .unavailable(.deviceNotEligible),
     .unavailable(.appleIntelligenceNotEnabled),
     .unavailable(.modelNotReady):
    // fall back to a non-AI path
    return
case .unavailable:
    return
}

let session = LanguageModelSession(
    instructions: "Summarize the passenger's note in one sentence."
)
let response = try await session.respond(to: note)

Guided generation doesn’t change. You define a @Generable type, constrain it with @Guide, and get a typed Swift value back instead of a string you have to pray-parse. In VinylCrate that’s how an album insight comes back as a struct I can render, not prose I have to scrape — same mechanism as the support summary here.

@Generable(description: "A structured summary of a support message")
struct TicketSummary: Sendable {
    @Guide(description: "One-line summary, no punctuation at the end")
    var headline: String

    @Guide(description: "Urgency from 1 (low) to 5 (drop everything)", .range(1...5))
    var urgency: Int

    @Guide(description: "Two to four routing tags", .minimumCount(2), .maximumCount(4))
    var tags: [String]
}

let summary = try await session.respond(
    to: ticketBody,
    generating: TicketSummary.self
).content

Streaming doesn’t change. Tools don’t change. The Tool protocol still hands the model a function it can decide to call, and the model still decides whether to. VinylCrate’s recommendation feature leans on exactly this — a tool that queries the user’s actual SwiftData collection so the model can answer “do I own anything by Tool?” by reading real rows instead of guessing. The illustrative flight tool below has the same shape.

struct FlightStatusTool: Tool, Sendable {
    let name = "getFlightStatus"
    let description = "Look up live status for a flight number"

    @Generable
    struct Arguments: Sendable {
        @Guide(description: "IATA flight number, e.g. F9123")
        var flightNumber: String
    }

    func call(arguments: Arguments) async throws -> String {
        try await FlightService.status(for: arguments.flightNumber).summary
    }
}

Here’s why that matters. Every one of those snippets is written against the session, not against the on-device model specifically. So when the backend behind the session becomes a Private Cloud Compute model — or a frontier one — none of that code has to know. The @Generable schema is the same schema. The Tool is the same tool. The streamResponse loop is the same loop. I wrote VinylCrate’s recommendation feature once; whether the model behind it is the local one or something bigger is now a separate, movable decision.

What Changes Underneath: Availability, Errors, Guardrails

Don’t mistake “the call site is stable” for “nothing changed.” The call site is the only thing that’s stable. Everything you reason about around it moves tier to tier, and pretending otherwise is how you ship something that works on your desk and falls over in the field.

Three things shift the moment a request can leave the device.

Availability stops being a one-time check. On-device, availability is mostly a fact about the hardware and the user’s settings — eligible device, Apple Intelligence on, model downloaded. You check it once at the top and you’re basically done. The instant a request can go to PCC, availability becomes a per-call concern, because now it depends on the network and on capacity that isn’t yours. A session that resolved fine ten seconds ago can fail now. So that availability switch I wrote as a one-time guard becomes something I have to be willing to hit on every request, with a fallback that actually does something useful instead of an empty state.

The error surface widens, and the catch blocks stop being boilerplate. On-device I mostly reasoned about guardrailViolation, refusal, and exceededContextWindowSize. Off-device, rateLimited and assetsUnavailable go from theoretical to load-bearing. The same prompt can succeed on-device, get rate-limited at PCC, and cost money at the frontier tier. At that point the catch block isn’t error handling tacked on at the end — it’s the routing logic.

do {
    let result = try await session.respond(to: prompt, generating: TicketSummary.self)
    return .ok(result.content)
} catch LanguageModelSession.GenerationError.rateLimited {
    // a tier-level signal: back off, or drop to a cheaper tier
    return .retryLower
} catch LanguageModelSession.GenerationError.exceededContextWindowSize {
    // a smaller model can't hold this — escalate or truncate
    return .escalate
} catch LanguageModelSession.GenerationError.guardrailViolation {
    return .blocked
} catch LanguageModelSession.GenerationError.refusal(let refusal, _) {
    let why = try? await refusal.explanation
    return .refused(why ?? "")
} catch LanguageModelSession.GenerationError.assetsUnavailable {
    return .unavailable
}

And the guardrail and refusal behavior isn’t identical across tiers. A larger model refuses differently, hallucinates differently, follows instructions more tightly. The shape of the error is the same — you still catch refusal — but what trips it shifts under you. I spent real time tuning VinylCrate’s insight prompts to thread the on-device guardrails, and I wouldn’t assume that tuning transfers cleanly to a bigger model. So test the prompt against the tier you’ll actually run it on, not the one that was convenient to develop against.

So: identical syntax, different envelope. The discipline is to treat the session as a stable interface over an unstable substrate — which is just dependency injection wearing a new hat, except the dependency is “how much does this thought cost, and where does it happen.”

What “Free” Actually Buys You

I want to be careful with the word free, because it’s doing a lot of work and it’s only half true.

On-device is free and private and offline, and it’s also small and quality-bounded. It will not write your legal brief. It is, though, genuinely excellent at the thing it’s for: fast, structured, local transformations of data the user already has. Tagging, summarizing, extracting, rewriting in a tone — the stuff where shipping the data to a server always felt slightly gross anyway. Three of VinylCrate’s four features live comfortably here and have no business going anywhere else.

Private Cloud Compute is the part that reframes things. It’s free for most of us in the sense that Apple is eating the compute — a platform play, the way iCloud’s free tier is a platform play. You get server-scale quality with a privacy guarantee that’s actually verifiable rather than a pinky-swear in a policy document. But free isn’t free of consequences. You’re now reasoning about round-trips, about finite capacity, about rate limits that exist because it’s free. Apple paying the bill doesn’t repeal the laws of physics.

Frontier third-party models routed through the same session are where the meter finally starts running — your account, your key, your bill. That’s fine. Some features genuinely need a frontier model and are worth paying for. The point is that the billing boundary and the capability boundary now line up, and you can see exactly when you’ve crossed one.

The old mental model was a single scary question — “is this feature worth an API call?” — and the answer was usually “let’s not, it adds up.” The new model is a gradient. Does this feature deserve the free, private, local tier? Most do; default here. Does it deserve the free-but-networked, server-quality, still-private tier? Some do; reach for it deliberately. Does it deserve the metered frontier tier? A few do; pay for those on purpose.

You stopped asking whether you can afford an LLM. You started asking how much model this particular feature has earned. That’s a healthier question, and it makes better products, because most features were never going to justify a frontier call — and now they don’t have to pretend they would.

Model Tier Routing as an Architecture Concern

If tier selection is a design decision, it deserves a place in your architecture instead of being smeared across call sites. I want a thin thing that, per feature, picks a tier, builds the right session, runs the request, and degrades gracefully when its first choice isn’t available.

One caveat before the code, because I’d rather be precise than impressive. The unified session and everything in the snippets above — LanguageModelSession, @Generable, Tool, streamResponse, the availability switch, the error cases — are the concrete public surface. The part where you select which backend a session targets is the directional piece. Treat the configuration knob below as the shape of the thing, not a symbol you can paste. The architecture is the point; the exact initializer is Apple’s to finalize.

Start with the policy as data, not as if statements scattered through your views.

/// What a feature is allowed to use, in preference order.
enum ModelTier: Sendable {
    case onDevice      // free, private, offline, small
    case privateCloud  // free-for-most, private, networked, large
    case frontier      // metered, networked, largest
}

struct TierPolicy: Sendable {
    /// Highest tier this feature is *permitted* to reach.
    let ceiling: ModelTier
    /// Whether it may fall back down the ladder on failure.
    let degrades: Bool
}

extension TierPolicy {
    /// Cheap, local, runs anywhere. Most features live here.
    static let local = TierPolicy(ceiling: .onDevice, degrades: false)
    /// Wants quality but must never silently spend money.
    static let cloudFree = TierPolicy(ceiling: .privateCloud, degrades: true)
    /// Earns the frontier, falls back rather than failing the user.
    static let premium = TierPolicy(ceiling: .frontier, degrades: true)
}

Now the router. Its only job: given a policy and a prompt, build the right session, try the preferred tier, and walk down the ladder when a tier signals it can’t serve the request. Because building sessions and talking to networked backends is background work that holds no UI state, this is an actor — its own isolation domain, Sendable by construction.

actor ModelRouter {
    /// Directional: the real knob for selecting a backend is Apple's to ship.
    /// This is the *shape* — a session bound to a chosen tier.
    private func makeSession(
        for tier: ModelTier,
        instructions: String
    ) -> LanguageModelSession {
        switch tier {
        case .onDevice:
            return LanguageModelSession(
                model: SystemLanguageModel.default,
                instructions: instructions
            )
        case .privateCloud, .frontier:
            // Shape, not a copy-pasteable symbol: a session whose backend
            // is the PCC or frontier model rather than the local one.
            return LanguageModelSession(
                model: SystemLanguageModel.backed(by: tier),
                instructions: instructions
            )
        }
    }

    private func ladder(for policy: TierPolicy) -> [ModelTier] {
        let full: [ModelTier] = [.onDevice, .privateCloud, .frontier]
        let allowed = full.filter { $0 <= policy.ceiling }
        // Prefer the ceiling, degrade downward if permitted.
        let ordered = allowed.reversed().map { $0 }
        return policy.degrades ? Array(ordered) : [policy.ceiling]
    }

    func respond<Output: Generable & Sendable>(
        to prompt: String,
        as type: Output.Type,
        instructions: String,
        policy: TierPolicy
    ) async throws -> Output {
        var lastError: (any Error)?

        for tier in ladder(for: policy) {
            let session = makeSession(for: tier, instructions: instructions)
            do {
                return try await session.respond(to: prompt, generating: type).content
            } catch LanguageModelSession.GenerationError.rateLimited,
                    LanguageModelSession.GenerationError.assetsUnavailable {
                // Tier-level "not now" — drop to the next tier down.
                lastError = RouterError.tierUnavailable(tier)
                continue
            } catch LanguageModelSession.GenerationError.exceededContextWindowSize {
                // A smaller tier definitely can't hold this; stop walking *down*.
                throw RouterError.contextTooLarge
            }
            // guardrailViolation / refusal are content decisions, not capacity
            // decisions — let them propagate. Don't retry on a lower tier.
        }

        throw lastError ?? RouterError.exhausted
    }
}

enum RouterError: Error {
    case tierUnavailable(ModelTier)
    case contextTooLarge
    case exhausted
}

extension ModelTier: Comparable {
    private var rank: Int {
        switch self {
        case .onDevice: 0
        case .privateCloud: 1
        case .frontier: 2
        }
    }
    static func < (lhs: ModelTier, rhs: ModelTier) -> Bool { lhs.rank < rhs.rank }
}

The two things I actually care about in that code aren’t the session-building — they’re the decisions about what to retry. rateLimited and assetsUnavailable are capacity signals, so they degrade down the ladder. exceededContextWindowSize walking down makes no sense, because a smaller model has less room, not more, so it throws. This is a real wall for VinylCrate, by the way: a 500-record collection blows straight past the on-device context window the moment you try to stuff it into a prompt, which is exactly why recommendations queries the collection through a tool instead of inlining it. And guardrailViolation / refusal are content decisions, not capacity decisions — retrying those on a different tier is how you accidentally build a system that shops around for a model willing to do the thing your guardrails just stopped. That distinction is the product judgment. The networking is plumbing.

Then the UI side stays exactly as boring as it should be. The view model is @MainActor @Observable, owns no model logic, and just expresses intent — this feature is local, that one earns the cloud.

@MainActor
@Observable
final class TriageViewModel {
    private let router = ModelRouter()
    var summary: TicketSummary?
    var state: LoadState = .idle

    func triage(_ body: String) async {
        state = .loading
        do {
            // This feature is worth server quality, but must never spend money.
            summary = try await router.respond(
                to: body,
                as: TicketSummary.self,
                instructions: "Summarize and route this support ticket.",
                policy: .cloudFree
            )
            state = .loaded
        } catch {
            state = .failed
        }
    }
}

enum LoadState { case idle, loading, loaded, failed }

Notice what the view model doesn’t know. It doesn’t know whether the work happened on the device or in a data center, or whether it degraded from PCC down to on-device when capacity was tight. It knows it asked for a quality summary that costs nothing and either got one or didn’t. The policy — .cloudFree, .local, .premium — is the whole decision surface, and it lives where product decisions belong: at the feature, declared, reviewable, one line.

How This Changes Your Architecture

Three things to do differently, starting on the next feature.

Stop hand-writing tier choices into call sites. The minute you have more than one backend, “which model” is a policy, and policy belongs in one place you can audit — not sprinkled across view models where nobody can answer “what are we sending to a server and what’s staying local?” If a privacy reviewer can’t read your tier policy in a single file, you don’t have a privacy posture, you have a vibe.

Default to the free, private tier and make every escalation justify itself. The whole reframe collapses if you reach for the biggest model by reflex. Most features — tagging, summarizing, extracting, rewriting — are on-device features that might occasionally borrow PCC quality, not frontier features slumming it locally. Write .local until a feature earns more, and make .premium rare enough that seeing it in a diff makes you stop and read.

Treat your catch blocks as the system, not as error handling. In a single-backend world, errors were edge cases you logged. In a tiered world, rateLimited and assetsUnavailable are control flow — they’re how the system finds the tier that can serve the request right now. If your error handling is an afterthought, your routing is an afterthought, and your feature falls over the first time PCC is busy.

So back to VinylCrate, honestly. Three of the four features stay exactly where they are — local, .local, done. They were always on-device features and they’re better for it. The one I’d actually consider for .cloudFree is recommendation reasoning: the part where the model explains why it’s surfacing an album, where a bigger model genuinely writes a better sentence and the privacy guarantee still holds. That one might have earned a step up the ladder. And nothing in VinylCrate earns the frontier tier. Not because I couldn’t wire it up — because no feature in the app needs that much model, and pretending otherwise would be paying for capability my users would never feel.

That’s the part I had to sit with. The unified session was sold as a convenience — write less code, support more models — and that undersells it. What Apple actually shipped is a world where the most capable, most private inference my app can do is the inference I’m not paying for, behind an API that doesn’t flinch when the model behind it changes. The skill that matters now isn’t calling an LLM. Everybody can call an LLM. It’s deciding, per feature, how much model the thing in front of you has earned — and being honest when the answer is less than you think.