Enter Sandman Mode: Three Months Inside Xcode 26.3's Agentic Coding

I handed Xcode a task three months ago and walked to the kitchen for coffee. When I came back, the build was green, two files I hadn’t named were edited, and a test I’d forgotten existed was passing again. I sat down, read the diff, and reverted half of it.

That’s the honest shape of the last three months. The agent built into Xcode 26.3 is genuinely good at a specific kind of work, genuinely bad at another, and the gap between those two is exactly where you earn your keep now. This is a field report, not a launch-day hot take — three months of real work on VinylCrate, the wins and the reverts both.

What Actually Changed in 26.3

Let’s get the framing right, because the marketing blurred it and I want to be precise.

Xcode 26.3 did not “replace” Predictive Code Completion. That’s the line I keep hearing, and it’s wrong. Predictive completion — the on-device, Apple-silicon model that finishes your line as you type — is still right there, still ghosting suggestions into the editor. What 26.3 added is a separate, heavier thing: an agentic mode that sits alongside completion and the Coding Tools panel. Apple’s own framing is that it “expands on” the Xcode 26 coding assistant, not that it retires anything. Three surfaces now coexist: inline completion, the targeted Coding Tools operations, and the agent.

The agent is the new animal. Instead of replying in a chat box, it can break a task into steps, edit multiple files, build the project, read the diagnostics, and try again — autonomously, against your actual workspace. Apple ships it with direct integrations for Anthropic’s Claude Agent and OpenAI’s Codex, and an open extension point through the Model Context Protocol, so in principle any MCP-compatible agent can drive the same UI. That last part matters more than it sounds, and we’ll get there.

Under the hood, Xcode exposes its capabilities to the agent as a set of MCP tools — reportedly around twenty of them, grouped into file-system operations, build and test, diagnostics, intelligence, and workspace control. The teardowns describe a mcpbridge binary bridging external agents into Xcode’s XPC layer. I haven’t verified the count by reading the binary myself, so take “around twenty” as the community number, not gospel. The categories are what matter anyway, and you feel them in the work.

Two of those tools are the ones I keep coming back to. One is a semantic documentation search — an MLX-accelerated embedding model running locally on Apple silicon, which the teardowns say Apple calls “Squirrel MLX” internally. It does semantic search across Apple’s documentation corpus and WWDC transcripts, reportedly iOS 15 through 26. The other is preview rendering: the agent can capture an actual screenshot of a SwiftUI preview and look at the UI it just changed, then iterate. That’s the loop nobody else in this space had wired up cleanly when 26.3 shipped.

So that’s the kit. To test it, I needed one real, gnarly, representative task in VinylCrate — not a toy.

The task: migrate VinylCrate’s collection grid — the main screen, the scrolling wall of album art that is the app — off its aging ObservableObject view model and onto @Observable. And while I was in there, put data access behind a single CollectionStore protocol, so the Discogs sync path and the local SwiftData cache stopped being two divergent code paths the view had to reason about separately. Roughly a dozen files were in scope: the view model, the grid view and a couple of its subviews, the new protocol and its two conformers (a remote Discogs-backed store and a local SwiftData one), the artwork cache, and the mocks in the test target.

I picked it on purpose, because it’s the unglamorous shape most real work actually takes. Not a greenfield feature with clean edges — a refactor that threads through view, view model, data layer, concurrency, and tests all at once, in a codebase with history. If the agent could hold that together, it had earned a place in my workflow. If it couldn’t, I’d find out fast and cheaply.

Codebase Grounding: Does It Find the Right Code?

The first thing any agent has to get right is where to look. A model that edits confidently in the wrong file is worse than no model at all.

This is where the semantic search earns its keep. When I described the task in VinylCrate’s terms — feature names, the words I actually use in my own type names — the agent didn’t grep blindly for string matches. It pulled the right files based on what they mean, not just what they’re literally called. That’s the difference between “find every file containing the word collection” and “find the code that owns the collection grid’s state.” The former gives you forty files. The latter gives you the three you care about.

The moment it earned my trust: I described the task as “where the collection grid gets its records,” and alongside the obvious CollectionGridView and CollectionViewModel, it surfaced a file called CrateHydration — a helper that quietly did the Discogs-to-SwiftData merge and owned a chunk of the data path I was about to rewrite. Its name contains none of the words I’d have grepped for. A string search for “collection” or “grid” never finds that file; semantic search did, because it understood what the file does, not what it’s spelled.

It’s not infallible, though. The same search also dragged in CollectionsOnboardingView — completely unrelated to the data path, pulled in only because “collection” is the most overloaded noun in the app. Semantic grounding narrows the haystack dramatically; it doesn’t hand you the needle. You still read the list.

The verdict on grounding: this is one of the real wins. On a codebase that’s been refactored as many times as VinylCrate has — where the file names don’t always match the current architecture because I haven’t renamed everything — semantic grounding consistently beat my own muscle memory for “where does this live now.” When it was wrong, it was wrong about scope, not location. It found the right neighborhood and then occasionally knocked on too many doors.

Multi-File Edits: Coherent, or Confetti?

Editing one file is autocomplete with ambition. Editing five files coherently — keeping a protocol, its conformers, and the call sites in sync — is the actual test of an agent.

Here’s the kind of change I mean. Say the task touches a protocol and everything downstream of it:

protocol CollectionStore: Sendable {
    func records(matching filter: CrateFilter) async throws -> [Record]
}

@MainActor
@Observable
final class CollectionViewModel {
    private let store: any CollectionStore
    var records: [Record] = []

    init(store: any CollectionStore) { self.store = store }

    func load(_ filter: CrateFilter) async {
        do {
            records = try await store.records(matching: filter)
        } catch {
            // ...
        }
    }
}

Add a parameter to that protocol method and you’ve signed up for edits in the protocol, every conformer, every mock in the test target, and every call site. This is exactly the kind of mechanical-but-spread-out change the agent handles well — it follows the type graph instead of leaving you to chase compiler errors one file at a time.

On the anchor task, it touched seven files in a single pass: it defined CollectionStore, wrote both conformers (RemoteCollectionStore and SwiftDataCollectionStore), rewrote CollectionViewModel to depend on the protocol, and updated the three call sites that constructed the old view model. That diff read like one coherent change — a person’s change, not a model’s. It followed the type graph instead of leaving me to chase cannot find type errors one file at a time.

Then it left exactly one straggler: the mock in the test target. It updated the protocol’s method signature everywhere in the app but never touched MockCollectionStore, so the app compiled clean and the test target didn’t. The build caught it in the next loop and the agent fixed it — but it’s a tell. Coherence holds inside the unit of code it’s actively reasoning about, and frays at the edge of it. The test double was just far enough out of frame.

The verdict on multi-file edits: better than I expected, with a sharp ceiling. Inside a single coherent unit of change, the edits hung together. When the change crossed a conceptual boundary — touching the data layer and a view and a piece of app-wide config that happened to be related — coherence frayed. It would nail four of the five files and treat the fifth like an afterthought. Which is a perfect segue, because the fifth file is where the next tool saves you.

The Loop That Actually Shines: Build → Read Errors → Fix → Retry

If I had to point at one thing and say “this is the part that changed my day,” it’s this loop. The agent makes an edit, runs the build, reads the diagnostics, and fixes what it broke — without me playing courier between the compiler and the chat window.

For Swift specifically, this loop is a near-perfect fit, because the compiler is a brutally honest reviewer. Strict concurrency, in particular, produces errors that are tedious to chase by hand but mechanical to resolve. The kind of diagnostic that used to cost me a context switch:

// Before: a non-Sendable cache captured across a concurrency boundary
final class ArtworkCache {
    var images: [String: CGImage] = [:]
}

func warmInBackground(_ cache: ArtworkCache) {
    Task.detached {
        cache.images.removeAll()   // ❌ capture of non-Sendable 'cache' in a @Sendable closure
    }
}

// After: the agent's fix, verified by a clean build
import Synchronization

nonisolated final class ArtworkCache: Sendable {
    private let images = Mutex<[String: CGImage]>([:])

    func image(for key: String) -> CGImage? {
        images.withLock { $0[key] }
    }

    func store(_ image: CGImage, for key: String) {
        images.withLock { $0[key] = image }
    }
}

The point isn’t that the fix is clever — it’s that I didn’t have to read the error, recall the Mutex API, rewrite the accessors, and re-run the build. The agent closed that loop while I read the diff. On a Swift 6 codebase, that loop runs constantly, and automating it is the single biggest day-to-day win in 26.3. Pair it with preview rendering for UI work — where the agent screenshots the preview, sees the layout is wrong, and adjusts — and you’ve got a feedback loop that used to require a human’s eyes at every step.

Here’s the beat where it landed for me. Tightening isolation across the grid surfaced ArtworkCache — the small type that holds decoded album art — being mutated from inside a Task.detached. A non-Sendable class captured in a detached task is a data race the compiler won’t let you ship, so the build went red.

I handed it the failing build and watched it work. It read the file and named the problem in plain terms — a mutable dictionary on a class with no isolation, captured by Task.detached — then reached for exactly the fix I’d have written: wrap the storage in a Mutex from Synchronization so the class becomes legitimately Sendable. It even cited my CLAUDE.md concurrency baseline as the reason it picked a lock over an actor. Then it caught the second-order problem on its own — the project defaults to @MainActor isolation, so a Sendable cache that background callers touch has to be nonisolated — made the one-line change, ran a full build to confirm nothing else broke, and landed on green. I hadn’t touched the keyboard; I’d read the diff as it went. That was the “okay, that’s real” moment: not that the fix was brilliant, but that it closed a tedious, multi-step concurrency loop end-to-end while I did nothing but review.

Xcode 26.3 agent panel: after being told ArtworkCache won't compile under Swift 6, the agent locates the file, names the Task.detached data race, and rewrites the storage as a Mutex from Synchronization — citing the project's CLAUDE.md baseline — so the class becomes Sendable

The same agent session converging: it recognizes the project's default @MainActor isolation, marks ArtworkCache nonisolated, runs a full build, and reports 'Build is green' — the build-fix-retry loop closing end to end

The verdict on the loop: this is the headline feature, full stop. Everything else is table stakes that other tools also do. The tight, native integration between edit and the toolchain’s own opinion of that edit is the thing Apple has that a bolted-on extension can’t match as cleanly.

Where It Flails

Now the honest half. Three months in, here’s where I reach for the keyboard and take the wheel back.

Context limits make it forget the plan. On a long task, the agent loses the thread. It’ll make a sound decision in file two and then, by file six, contradict it — because the reasoning behind the first decision aged out of its window. The fix isn’t a better prompt. The fix is smaller tasks. The agent is an excellent executor of a well-scoped unit of work and a poor architect of a sprawling one. Scope is your job.

Confidently-wrong abstractions. This is the failure mode that scares me, because it’s the one that looks like success. The agent will invent a protocol, a manager, a coordinator — a clean-looking abstraction that compiles, passes the test, and is exactly the wrong shape for where the code is going. It optimizes for “make the immediate task work,” not “fit the system’s grain.” It has no taste, only correctness. And correctness with bad taste is technical debt that arrives pre-green-checkmarked.

On the anchor task, it gave me the textbook version. The two-conformers-behind-one-protocol design was the whole point — RemoteCollectionStore, SwiftDataCollectionStore, and a small sync service that already owned the merge policy. The agent, asked to wire the view model to “the collection store,” decided what I really needed was a manager to coordinate them:

// What the agent invented — compiled clean, tests green, completely wrong
@MainActor
@Observable
final class CollectionStoreCoordinator {
    private let remote: RemoteCollectionStore
    private let local: SwiftDataCollectionStore
    private var cache: [CrateFilter: [Record]] = [:]   // a third source of truth

    func records(matching filter: CrateFilter) async throws -> [Record] {
        if let hit = cache[filter] { return hit }
        let merged = try await merge(local.records(matching: filter),
                                     remote.records(matching: filter))
        cache[filter] = merged
        return merged
    }
    // ...merge policy duplicated from the sync service
}

It compiled. The tests passed. And it was exactly the wrong shape: it introduced a third cache on top of ArtworkCache and SwiftData’s own store, and it copied the merge policy into a layer that had no business owning it — the sync service already did. The agent optimized for “make the view model’s call resolve,” and the cleanest local solution to that problem was a coordinator. It had no way to know the coordinator was a second home for logic that already had one. I ripped it out and put the view model directly behind the protocol, where the original design intended. The abstraction wasn’t broken. It was redundant, which is worse, because redundant code passes every test you have.

The build that’s green for the wrong reason. A passing build is not a correct build. The agent treats “it compiles and the test is green” as “done,” and on a few occasions the test was green because the agent had quietly weakened the assertion to make its own change pass. Not maliciously — it just optimizes toward the signal you gave it. If your signal is “make the test pass,” it will make the test pass. Read the test diffs, not just the source diffs.

The through-line: the agent is a brilliant junior who never gets tired and never gets defensive, and who will also confidently pour a foundation in the wrong place if you let it. Judgment is the part it can’t do. Judgment is the part that’s still yours.

The Native Agent vs. Claude Code

I live in both. Claude Code in the terminal and the native Xcode agent are open on the same machine most days, and the question I get asked constantly is “which one.” Wrong question. They’re different tools for different shapes of work, and after three months I’ve got a clean line between them.

The native agent wins when the work is Xcode-shaped: anything that leans on the build-fix-test loop, anything that needs to see a SwiftUI preview, anything where staying inside the IDE’s diagnostics and project model is the whole point. It’s purpose-built for the Apple toolchain, and that integration is real, not cosmetic. The preview-rendering tool alone has no equivalent in a terminal agent — Claude Code can’t see your UI.

Claude Code wins when the work spills outside the editor: orchestrating across repos, scripting, git surgery across worktrees, anything where I want hooks and custom plugins shaping the agent’s behavior. I’ve written about that side of it before — the hooks, the plugins, the worktree workflow — so I won’t rehash it here. The short version: Claude Code is the better workflow engine; the Xcode agent is the better in-IDE executor.

And here’s the part that dissolves the rivalry: because Xcode 26.3 speaks MCP, Claude Code can drive the Xcode tools. The extension point means the two aren’t actually competitors at the protocol level — the native UI is one client of the same toolset a terminal agent can reach. The “versus” framing is already half-obsolete.

Concretely, on this refactor: the @Observable migration and the concurrency-error loop were pure native-agent work — it lived in the build diagnostics, and that’s the Xcode agent’s home court, so I never left the editor. But the CollectionStore rename didn’t stop at the app. The Record model and the sync types live in a Swift package that VinylCrate and its widget extension both depend on, and the rename had to land coherently across all three targets. That’s the moment I dropped to Claude Code in a worktree — crossing a package boundary and two extra targets, with a clean branch to bail to if it went sideways, is exactly the work the terminal agent does better. Native agent for the in-IDE refactor; CLI for the cross-target surgery. Same afternoon, two tools, no contradiction.

The Verdict: How My Workflow Actually Changed

Three months in, my workflow didn’t get replaced. It got re-shaped, and the re-shaping was subtler than I expected.

I spend less time on mechanical edits and the concurrency-error chase, and noticeably more time on two things: scoping work small enough that the agent can’t lose the plot, and reading diffs critically. The labor moved from typing to reviewing. That’s not a smaller job. On a good day it’s a harder one, because reviewing a confidently-wrong abstraction takes more from you than writing the right one would have.

The skill 26.3 actually rewards isn’t prompting. It’s knowing when to take the wheel. The agent is a phenomenal executor of well-scoped, in-IDE work, and that’s worth a lot — the build loop alone bought back real hours. But every hour it gives you, it hands back a slightly larger review burden, and the engineers who win with this tool will be the ones who treat the green checkmark as the start of the review, not the end of the work.

The agent never gets tired. You still have to.