Testing Foundation Models: Code That Won't Give the Same Answer Twice

A confession before we start

VinylCrate ships four on-device AI features built on Apple’s Foundation Models framework. Album insights, recommendations, collection summaries, the lot. I wrote a whole post about building them — running everything on-device with @Generable, no server, no API key. I wrote another one about wiring up tool calling so the model could reach into the user’s actual collection. (I’ve since written about how Apple Foundation Models became a hybrid platform — the same session routing across on-device, Private Cloud Compute, and frontier tiers.) Those were the chapters about building the thing.

This is the chapter where I admit I never tested it.

Not loosely. Precisely. The app had 59 @Test functions and a 942-line AIService, and not one of those tests touched the model. No determinism configuration. No seam to mock it. No assertion on what happens when the model refuses a prompt. I built a feature whose entire value is the output of a language model, and then I tested everything except the language model.

I know how that happened, because it’s the same way it happens to everyone. You sit down to write the first test, you type #expect(insight.summary == ...), and you stop. Because what goes on the right side of that ==? The model doesn’t return a fixed string. It returns a string. A good one, usually. A different one, often. There’s nothing to assert equality against, so you close the file and tell yourself the UI tests cover it.

They don’t. So let’s fix it properly — with Swift Testing, and a strategy built for non-deterministic output instead of against it.

Stop trying to assert equality on AI output

Here’s the reframe that unlocks all of this: a language model is not a function in the mathematical sense. The same input does not map to the same output. The moment you accept that, you stop asking “what string will it return” and start asking “what must be true about whatever it returns.”

That question has answers. Lots of them. They live at different levels of confidence — three concentric rings at which you can test non-deterministic Foundation Models output:

The deterministic floor — things that are structurally guaranteed no matter what words come out.
Behavior around a seam — your logic surrounding the model: caching, fallbacks, error mapping. Deterministic, offline, fast.
Evals — real prompts against the live model, scored against a rubric, tracked as a rate. Not pass/fail. A quality distribution.

The trap is thinking ring 3 is the goal and the first two are warm-ups. It’s the opposite. Most of the bugs that ship in AI features aren’t bad prose — they’re a refusal you didn’t handle, a cache that never hits, an enum that drifted out of sync with the schema. Those are ring 1 and ring 2, and they’re fully deterministic. The model being unpredictable doesn’t make your code around it unpredictable. That’s the whole game.

Ring 1: the deterministic floor — test the contract, not the words

First, a tool that surprises a lot of people: you can pin the model’s output, at least for a fixed model build.

// Greedy sampling removes the dice roll: for a fixed model build,
// the same prompt yields the same tokens every run.
let deterministic = GenerationOptions(sampling: .greedy)

// If you need sampled (more natural) output but still want it pinned
// in a test, seed the RNG instead of leaving it to chance:
let seeded = GenerationOptions(sampling: .random(top: 40, seed: 42))

Greedy sampling always picks the most likely next token, so the randomness goes away. Seeded sampling keeps the more natural, varied output but makes the dice rolls reproducible. Either one gives you a repeatable run.

Here’s the thing though — repeatable is not the same as guaranteed. Greedy output is stable for this model build on this OS. Apple ships a model update, the weights shift, and your “deterministic” string changes out from under you. So determinism is a tool for reproducing a run while you debug. It is not license to assert on the prose. The minute you write #expect(rec.reasoning == "Because you own three Coltrane records..."), you’ve written a test that passes today and pages you the morning iOS 26.2 ships. Don’t.

What you assert on instead is the contract — the part of the output that’s structurally guaranteed because you guaranteed it, with @Guide and your own types. And the single best ring-1 test in VinylCrate needs no model at all.

When I built recommendations, I constrained one field at the schema level — connectionType can only be one of five strings — and then, because the UI needs to switch on it, I wrote a separate enum mirroring those exact five values. Two lists. Same values. Nothing keeping them in sync except my memory, which is not a dependency I’d ship.

import Testing
@testable import VinylCrate

// AlbumRecommendation.connectionType is constrained in the model schema:
//   @Guide(.anyOf(["artist", "genre", "era", "label", "style"]))
// ...and a separate enum mirrors those values for the UI layer.
// Nothing stops those two lists from drifting apart. A test does.
@Test("Every connectionType the model may emit maps to a known case")
func connectionTypeContractHolds() {
    let allowed = ["artist", "genre", "era", "label", "style"]
    for value in allowed {
        #expect(
            RecommendationConnectionType(rawValue: value) != nil,
            "\(value) is in the schema's allowed set but has no enum case"
        )
    }
    #expect(RecommendationConnectionType.allCases.count == allowed.count)
}

This is the one I want you to steal today. It runs in microseconds. It never launches the model. And it catches the exact bug that will otherwise reach a user: I add "mood" to the @Guide set six months from now, ship it, the model dutifully emits "mood", my UI enum has no case for it, and the recommendation silently falls through to a default — or worse, crashes. The schema is a contract between my code and the model, and like any contract between two parties that can each change independently, it needs a test sitting on the boundary watching for drift.

The lesson generalizes. Anywhere you’ve mirrored a model constraint into a Swift type — allowed values, count ranges, required fields — that mirror can crack. A pure, model-free test holds it together.

When you do want to run the real model, assert shape, never words.

@Test("A generated recommendation honors its structural contract",
      .enabled(if: AIService.isModelAvailable))
@MainActor
func recommendationStructureHolds() async throws {
    let rec = try #require(
        await aiService.firstRecommendation(from: TestFixtures.sampleCollection())
    )
    #expect((1...3).contains(rec.basedOn.count))            // @Guide min/max
    #expect(RecommendationConnectionType(rawValue: rec.connectionType) != nil)
    #expect(!rec.title.isEmpty)
    // We assert the shape. `reasoning` can say anything — we never check its words.
}

Every assertion here is on something I declared in the schema. The basedOn array has a min/max count, so I check the range. connectionType has an allowed set, so I check membership. title is required, so I check it isn’t empty. And reasoning? The model can wax poetic about hard bop or keep it terse. I don’t care. I never read its words in a test, because its words are the one thing I can’t promise.

Note the .enabled(if:) trait. That’s Swift Testing’s gate, and it’s load-bearing for everything that touches the live model — more on that when we hit the trade-offs, because it has teeth.

Ring 2: mock the model and test the logic around it

Ring 1’s live test exposed VinylCrate’s real problem. Go look at it again — it reaches the model through AIService, and the version I first shipped created its LanguageModelSession inline, concretely, the moment it needed one. Which meant that test — and any test like it — couldn’t run without live Apple Intelligence on the machine. No model, no test. That’s why I had 59 tests and zero coverage of this code: it was physically untestable offline.

The fix is the oldest move in the book — a seam. Put everything the app needs from the model behind a protocol, and translate the framework’s errors into your own domain error right at that boundary.

// The seam: everything the app needs from the on-device model, behind a protocol.
// Note the typed throw — the framework's GenerationError is translated into our
// own domain error right here, so the rest of the app never sees a raw NSError.
protocol AlbumInsightGenerating: Sendable {
    func insights(for album: ReleaseDetail) async throws(AIServiceError) -> AlbumAIInsight
}

// Production: wraps a real LanguageModelSession and maps failures at the boundary.
struct LiveInsightGenerator: AlbumInsightGenerating {
    func insights(for album: ReleaseDetail) async throws(AIServiceError) -> AlbumAIInsight {
        do {
            let session = LanguageModelSession(instructions: Self.instructions)
            return try await session.respond(
                to: Self.prompt(for: album),
                generating: AlbumAIInsight.self,
                options: GenerationOptions(sampling: .greedy)
            ).content
        } catch let error as LanguageModelSession.GenerationError {
            throw AIServiceError(error)   // guardrailViolation, refusal, rateLimited, …
        } catch {
            throw .generationFailed("Unexpected failure")
        }
    }
}

Two things are happening here and both matter. The protocol gives me a place to inject a fake. And the typed throw — throws(AIServiceError) — turns the boundary into a translation layer. LanguageModelSession.GenerationError is the framework’s vocabulary: guardrail violations, refusals, rate limiting, context overflow. The rest of my app shouldn’t know those words exist. It should know about my failures, the ones I’ve decided how to handle. So I map them once, here, and everything upstream gets a closed, exhaustive AIServiceError. Swift 6’s typed throws makes that boundary a compile-time promise instead of a hope.

With the seam in place, the fake is trivial.

// Test double: returns or throws exactly what the test dictates. Runs anywhere.
struct StubInsightGenerator: AlbumInsightGenerating {
    var result: Result<AlbumAIInsight, AIServiceError>
    func insights(for album: ReleaseDetail) async throws(AIServiceError) -> AlbumAIInsight {
        try result.get()
    }
}

“Runs anywhere” is the phrase to sit with. No Apple Intelligence. No real device. No network. The CI box, the Simulator, your laptop on a plane with the model assets evicted — all fine. We’ve decoupled testing the logic from running the model, and that’s the unlock for the most important tests in the whole feature: the failure paths.

Because here’s what actually hurts users. Not a mediocre recommendation — a crash, or a raw error -1. leaking into a label, when the model refuses or the device runs out of context. Those paths were completely untested in VinylCrate, and they’re the ones most likely to hit a real person, because real people paste weird things into text fields and run apps on devices with Apple Intelligence disabled.

Swift Testing’s parameterized tests turn every failure mode into one tidy suite.

@MainActor
@Suite("AIService surfaces every failure safely")
struct AIServiceFailureTests {

    // One test body, every failure mode. This is what @Test(arguments:) is for.
    @Test("Failures yield no insight and a user-safe message", arguments: [
        AIServiceError.guardrailViolation,
        .refused,
        .rateLimited,
        .contextTooLarge,
        .modelUnavailable,
    ])
    func failuresFallBackCleanly(_ failure: AIServiceError) async {
        let aiService = AIService(generator: StubInsightGenerator(result: .failure(failure)))

        let insight = await aiService.albumInsights(for: TestFixtures.makeRelease())

        #expect(insight == nil)                 // we degrade, we don't crash
        #expect(aiService.lastError == failure)       // mapped to our domain
        #expect(saiServicet.userFacingMessage != nil)   // never leak "...error -1." to a user
    }
}

One body, five cases, five independent results in the test navigator. When Apple adds a sixth GenerationError next year, I add one line to the array and instantly find out whether my fallback logic handles it. The three assertions encode the contract I actually care about: when generation fails, the user gets nil (not a crash), my code records the right domain error, and there’s a human-readable message ready to show. None of that requires the model to run. All of it requires the model to be mockable, which is exactly what the seam bought me.

The same seam pays off for the unglamorous logic too — like caching, which VinylCrate does to avoid regenerating insights for an album you just looked at. How do you prove a cache works? You count.

// A counting fake proves the cache without ever touching the model.
actor CountingStubGenerator: AlbumInsightGenerating {
    private(set) var callCount = 0
    let insight: AlbumAIInsight
    init(returning insight: AlbumAIInsight) { self.insight = insight }

    func insights(for album: ReleaseDetail) async throws(AIServiceError) -> AlbumAIInsight {
        callCount += 1
        return insight
    }
}

@Test("A second request for the same album is served from cache")
@MainActor
func cacheHitAvoidsSecondGeneration() async {
    let generator = CountingStubGenerator(returning: .fixture)
    let aiService = AIService(generator: generator)

    _ = await aiService.albumInsights(for: .fixture)
    _ = await aiService.albumInsights(for: .fixture)

    #expect(await generator.callCount == 1)   // generated once; cache served the rest
}

The fake is an actor because the call count is mutable state crossing async boundaries, and I’m not going to launder a data race through @unchecked Sendable just to test a cache. Two requests for the same album, one generation. That’s a deterministic test of a behavior that has nothing to do with the model’s words and everything to do with whether my feature wastes the user’s battery. The model could be a coin flip and this test would still be exactly as valid.

That’s most of ring 2: error mapping, fallbacks, caching, retries, whatever orchestration logic you’ve wrapped around the model. It’s the bulk of the code in a 942-line service, it’s where most of the real bugs live, and once the seam exists, it’s all as deterministic and fast as any other unit test you’ve written.

Ring 3: writing evals for Foundation Models output

Rings 1 and 2 tested everything except the actual quality of the model’s output. At some point you have to look at the words. But you’re not going to assert on them — you’re going to measure them.

This is the mental shift that took me longest. A unit test asks a yes-or-no question: did the code do the thing, true or false. An eval asks a how often question: across many runs, how often does the output meet the bar. One returns a boolean. The other returns a rate. They feel similar because they both live in @Test functions, but they’re different instruments — a unit test is a smoke detector, an eval is a thermometer.

Concretely: I care that recommendations stay grounded in albums the user actually owns. The model has the collection via tool calling, but on-device models hallucinate a citation now and then — they’ll recommend something “because you love Yo Yo Honey Singh” when there’s no Yo Yo Honey Sing in the crate. I can’t assert that never happens. I can measure how often it does and refuse to ship if the rate slips.

@Test("Recommendations stay grounded in the user's actual collection",
      .enabled(if: AIService.isModelAvailable))
@MainActor
func recommendationsAreGrounded() async throws {
    let collection = TestFixtures.sampleCollection()
    let owned = Set(collection.map(\.title))

    let trials = 20
    var grounded = 0
    for _ in 0..<trials {
        let rec = try #require(await aiService.firstRecommendation(from: collection))
        if rec.basedOn.allSatisfy(owned.contains) { grounded += 1 }
    }

    // Not 100%. On-device models hallucinate a citation now and then.
    // We assert a quality FLOOR and track the rate over time like any other metric.
    let passRate = Double(grounded) / Double(trials)
    #expect(passRate >= 0.8, "Grounding rate dropped to \(passRate)")
}

Twenty trials, count the grounded ones, and the only #expect in the whole test is on the rate. Eighty percent, not a hundred. The first time you write >= 0.8 it feels like cheating — like you’re lowering the bar so the test passes. You’re not. You’re being honest about the instrument. A 100% grounding assertion against a probabilistic model isn’t a higher standard, it’s a flaky test that will eventually fail for reasons that have nothing to do with a regression you caused. The floor is the real standard, and the rate is the metric you watch drift over time, the same way you’d watch a p99 — the 99th-percentile slowest response.

The rubric is the hard part, and it’s the part that makes evals real engineering rather than vibes. “Grounded” is easy — set membership, fully objective. But “is this a good insight” needs a definition you can compute or a judge you trust, and writing that rubric forces you to actually articulate what good means. Which, it turns out, I’d never done for VinylCrate’s insights. I just knew them when I saw them. An eval doesn’t let you get away with that, and that discipline alone is worth the price of admission.

What this actually costs

I’d be doing the same thing I did for a year — pretending this is free — if I stopped here. It isn’t.

The seam is real refactoring. That 942-line AIService doesn’t grow a protocol boundary for free. Every place that news up a session has to route through the generator instead, the domain error type has to be designed and mapped exhaustively, and you have to resist the urge to make the protocol leak framework types (the moment AlbumInsightGenerating mentions LanguageModelSession, you’ve built a seam that seams nothing). It’s a day of careful work, not an afternoon. It’s also the highest-leverage day, because everything in ring 2 is impossible without it.

Evals need a suite, and suites rot. One grounding test is a demo. A real eval harness is a curated set of prompts with rubrics, and those prompts have to evolve as the feature does. Prompt changes, model updates, and new features all move the rates around, and you’ll spend time deciding whether a drop is a regression or just the model being the model. That’s ongoing maintenance, and pretending otherwise is how eval suites end up disabled and ignored.

The live rings don’t run in CI — and isModelAvailable alone won’t save you. This is the sharp one. Foundation Models needs Apple Intelligence, so I gated rings 1-live and 3 behind .enabled(if: AIService.isModelAvailable) and assumed the Simulator and CI would just skip them. Then a Simulator on my machine reported the model available, ran the eval, and failed — it answers the availability check but can’t actually generate. Availability is necessary, not sufficient. So the real gate ANDs it with an explicit opt-in:

enum LiveAITesting {
    static var isEnabled: Bool {
        AIService.isModelAvailable
            && ProcessInfo.processInfo.environment["RUN_LIVE_AI_TESTS"] == "1"
    }
}
// ...then: @Test("...", .enabled(if: LiveAITesting.isEnabled))

Now the live rings only fire when I deliberately set RUN_LIVE_AI_TESTS=1 on real, eligible hardware. Ring 2 — the deterministic, mockable ring — is the CI gate; the live evals run on demand before a release. Be deliberate about that split. A green CI badge that silently skipped every test that touches the model is worse than no badge, because it lies to you.

Where I landed

The honest version of this story is that I shipped a feature I couldn’t test, talked myself into believing the gap was the model’s fault, and only fixed it when I sat down to write about how I’d tell someone else to do it.

The fix wasn’t a clever testing trick. It was changing the question. You can’t ask a language model the same question and get the same answer, so stop building tests that demand one. Assert the contract, which never moves. Mock the model, and test the logic around it, which is where the bugs actually live. And for the words themselves — measure, don’t assert. Track a rate, set a floor, watch for drift.

Determinism is the floor of this work, not the goal. The goal is knowing — with evidence, not vibes — that the thing still does what it’s supposed to, even when it never says it the same way twice.

Extracting that protocol and breaking up the 942-line file was the work. The tests are what made it safe — they tell me the moment I’ve broken something, long before a user would. That’s the whole point: not that the model became predictable, but that I stopped flying blind.