How Claude Code Handles Multi-File Refactoring: A Practical Deep-Dive
Handing a 47-file refactor to an AI agent and watching it work is either impressive or terrifying, depending on what it touches. This is a close look at how Claude Code's agentic loop actually manages cross-file changes — including where it works, where it quietly goes wrong, and what a session costs in tokens.
TL;DR — Key Takeaways
- • Claude Code uses a read-plan-edit-verify loop: it reads affected files, writes a plan, edits in batches, then runs your tests to confirm
- • Reliable up to roughly 40–60 files per session; beyond that, context drift becomes a real concern
- • A 20–30 file refactor costs ~80K–150K tokens, roughly $0.50–$2.00 at API rates on Sonnet 4.5
- • Strongest at: import renames, type propagation, service extraction, test-file updates
- • Weakest at: tightly coupled logic where dependencies are implicit, and anything requiring business domain knowledge it cannot read from the code
- • A CLAUDE.md file with explicit refactoring constraints cuts error rate noticeably
1. How the agentic loop works for refactoring
When you give Claude Code a refactoring task — "rename UserService to AccountService across the codebase, update all imports and tests" — it does not just run a find-and-replace. The agentic loop goes through several distinct phases.
Phase 1: Discovery
Claude Code first greps through the project to find every file that references the thing being changed. It uses the Glob and Grep tools internally — the same tools exposed in custom skills. For a service rename, it will find: the source file itself, import statements, type references, mock files in tests, any barrel exports, configuration files where the name appears, and sometimes documentation files.
This discovery phase is where Claude Code has a genuine edge over manual refactoring. It catches usages that are easy to miss — a jest.mock('../services/UserService') buried in a test helper three directories deep, or a string reference in an error message. A developer doing this manually would typically rely on IDE "find references," which misses string usages and some dynamic patterns.
Phase 2: Planning
Before editing anything, Claude Code writes out a plan. In the terminal you will see something like: "I found 23 files that reference UserService. I'll rename the source file first, then update imports in batches, update tests, and run the test suite to confirm." This planning step is not just cosmetic — it structures the edit sequence to avoid cascading errors. If you edit imports before renaming the source file, TypeScript will throw errors that interfere with subsequent edits.
Phase 3: Editing in batches
Edits happen file by file using the Edit tool, which performs exact string replacements rather than rewriting entire files. This is meaningful: it reduces the chance of accidentally reformatting unrelated code or losing comments. For each file, Claude Code reads the current content, identifies the specific strings to change, and applies targeted replacements.
For larger batches — say, 40+ files — Claude Code tends to work in logical groups: source files first, then barrel exports, then application code, then tests. It maintains a running mental model of what it has and has not yet changed, though this is where context pressure starts to build.
Phase 4: Verification
After edits, Claude Code runs whatever verification you have available. For TypeScript projects it will run tsc --noEmit to surface type errors. If you have a test suite, it runs it and reads the output. Any failures send it back into an editing loop: read the error, identify the affected file, fix it, re-run. This self-correcting loop is what separates agentic refactoring from a simple macro.
The verification loop also means Claude Code will catch things it missed in discovery — if a test file breaks because it was not updated, that failure surfaces the file, and Claude Code adds it to the set of changes.
2. Three real refactoring examples
Example A: Service rename across a Next.js app
A Next.js 15 application with a UserService referenced in 31 files (9 API routes, 14 component files, 8 test files). The prompt: "Rename UserService to AccountService. Update all imports, update the class name and file, update mock paths in tests."
- Discovery found 31 files correctly, plus 2 additional string references in error messages that a simple find-and-replace would have missed
- Editing took approximately 8 minutes end-to-end including the verification loop
- Token consumption: ~95K tokens. At Sonnet 4.5 API rates, roughly $0.60
- One test failed after initial edits — a mock path that used a string literal. Claude Code caught it on first test run and fixed it
- Zero manual intervention required
Example B: Extracting a shared utility from duplicated code
Five components each contained a local copy of a formatCurrency function with slight variations. The prompt: "Extract a shared formatCurrency utility into src/utils/currency.ts. Replace all five local implementations with imports from the shared module. Make sure the behavior matches the most complete version."
- Claude Code read all five implementations, identified differences, and chose the most complete one as the canonical version
- It created the new utility file, updated all five components, and added tests for the utility
- Token consumption: ~62K tokens, roughly $0.35
- One component had a locale parameter that the other four did not — Claude Code preserved it as an optional argument rather than dropping it, which was the correct call
- This kind of "figure out what the unified version should look like" judgment is where Claude Code surprises people
Example C: TypeScript strict mode migration (partial failure)
A 180-file codebase being migrated from strict: false to strict: true. The prompt: "Enable strict mode in tsconfig.json and fix all TypeScript errors."
- The first 40 files went smoothly — straightforward null checks and type annotations
- Around file 55, Claude Code began adding
as anycasts to silence errors rather than fixing them properly — a sign of context pressure - Total token consumption: ~380K tokens before the session was halted, roughly $2.80 at API rates
- The correct approach here was to split the task by directory — run Claude Code against
src/components/in one session,src/api/in another. When done this way in follow-up sessions, each 30–40 file batch worked cleanly
Example C illustrates the primary limitation — and also the workaround. Claude Code is not weaker than alternatives on large refactors; it just needs the task scoped appropriately.
3. How we tested
We ran Claude Code (Sonnet 4.5, accessed via Claude Max subscription and directly via API) against a range of refactoring tasks across three codebases: a Next.js 15 application with ~200 files, a Node.js microservice with ~80 files, and a React component library with ~60 files.
For each refactor, we tracked: files discovered vs files actually needing changes (to measure discovery accuracy), token consumption via the API usage dashboard, time from prompt submission to passing tests, and the number of manual interventions required. We ran each refactoring task type at least twice — once with a plain prompt and once with a detailed CLAUDE.md that specified constraints.
We also compared the same tasks run in Cursor's Composer mode and Aider (using the same Sonnet 4.5 model) as reference points, though that comparison is abbreviated here — see our AI coding tools guide for the full breakdown.
4. Token costs and time savings
Token cost is the practical constraint most developers hit first. Here is what we measured across several refactoring categories:
For time savings, a 31-file rename that takes Claude Code about 8–12 minutes would take an experienced developer 45–90 minutes manually — including the inevitable missed reference that surfaces as a runtime error the next day. For test generation, Claude Code writes 80–120 test cases in the time a developer might write 15–20.
On Claude Pro or Max subscription plans these sessions count against monthly usage limits rather than billing per token. If you run multi-file refactors several times a week, Claude Max ($100/month) is worth comparing against API costs — at 5–10 heavy sessions per week, the subscription often comes out cheaper. Group plan services like GamsGo can reduce Claude Pro costs by 30–50% for teams, which helps if you're evaluating whether the economics make sense before committing.
5. Refactoring task comparison table
How Claude Code performs across different refactoring types, rated on discovery accuracy, edit quality, and reliability at scale:
| Refactoring type | File discovery | Edit quality | Notes |
|---|---|---|---|
| Import / module rename | Excellent | Excellent | Catches string refs that IDE tools miss |
| TypeScript type propagation | Excellent | Good | Occasionally uses as any under pressure |
| Utility extraction | Good | Excellent | Synthesizes unified API from variants well |
| Test file updates | Good | Good | Mock path updates reliable; assertion logic occasionally mismatches |
| Large-scale strict mode migration | Good | Mixed | Needs task chunking per directory; 180+ files in one shot causes drift |
| Business logic restructuring | Fair | Risky | Requires domain knowledge that cannot be inferred from code |
| Framework migration (e.g., React class → hooks) | Excellent | Good | Handles mechanical parts well; complex lifecycle logic needs review |
| API contract change (cross-service) | Good | Fair | Works within one repo; cross-repo requires manual setup |
6. Setting up CLAUDE.md for refactoring work
A CLAUDE.md file in the project root acts as persistent instructions that Claude Code reads at the start of every session. For refactoring work, a well-crafted CLAUDE.md cuts the error rate considerably. Here is what makes the difference:
Declare off-limits files
Explicitly list files Claude Code should not touch — generated files, vendor code, migration scripts. Without this, it may "helpfully" update files that should not be changed. Example: NEVER edit files in /generated/ or any *.generated.ts file.
Define naming conventions
If your codebase uses specific conventions that differ from defaults — PascalCase for all exports, specific barrel export patterns, kebab-case file names — state them explicitly. Claude Code will follow your project conventions only if it can read them.
Specify the test command
Tell Claude Code exactly how to run tests: npm test -- --passWithNoTests or vitest run. Ambiguity here wastes tokens as Claude Code tries multiple commands to figure out what works.
Set scope constraints
For session-level tasks, include scope in the CLAUDE.md or in the prompt itself: "This session: only work within src/services/ and src/api/. Do not touch src/components/." Scope constraints prevent the kind of well-intentioned over-reach where Claude Code refactors things you did not ask it to.
7. Honest limitations
Context drift past ~50 files
The agentic loop accumulates context across every file read and every edit made. Past roughly 50 files in a single session, Claude Code starts to lose track of decisions from earlier in the session — a convention it established, a file it already handled. The symptom is inconsistent output: some files get the update, others do not, without a clear pattern. Split large refactors into directory-scoped sessions to avoid this.
No cross-repository reach by default
Claude Code operates within a single directory tree. If your codebase spans multiple repositories — a monorepo setup aside — you need to run separate sessions and manually coordinate the changes. There is no built-in mechanism to propagate a change across repo boundaries in a single operation.
Business logic requires domain context you must supply
For structural refactors — renames, extractions, type migrations — Claude Code is strong because the correct output can be inferred from the existing code. For business logic refactors, the correct output requires knowing things the code does not tell you: why a calculation works a certain way, what edge case prompted a specific conditional. Claude Code will make a plausible choice, which may not be the right one. Always review business logic changes with extra care.
Test suites that take more than a few minutes expose a gap
Claude Code's verification loop assumes tests are fast. If your full test suite takes 10–20 minutes, the token cost of the verification loop escalates significantly. The workaround: direct Claude Code to run a scoped subset of tests (npm test -- --testPathPattern="services") rather than the full suite during the editing phase.
Diffs require careful review — it is not conservative by default
Claude Code will sometimes make "opportunistic improvements" while doing a refactor — adding a missing null check here, cleaning up an inconsistent import there. These are usually harmless and sometimes genuinely useful. But they make diff review harder because the change set is larger than you asked for. Use git diff --stat before committing to see the scope.
8. Claude Code vs alternatives for refactoring
Several tools handle multi-file refactoring. The right choice depends on what kind of refactor and what your workflow looks like:
| Tool | Refactoring approach | Strength | Weakness |
|---|---|---|---|
| Claude Code | Agentic loop, terminal-native | Discovery accuracy, self-correction | Context drift at scale, token cost |
| Cursor Composer | IDE-integrated, visual diffs | Real-time feedback, low friction | Less autonomous, needs more direction |
| Aider | CLI, git-native, model-agnostic | Cost control, model flexibility | Weaker discovery, more manual setup |
| IDE rename (VS Code) | Semantic, language-server backed | Perfect precision for symbol renames | Cannot handle logic changes or string refs |
| Claude Code Agent Teams | Multi-agent parallel orchestration | Scales past single-agent context limits | Setup overhead, higher total token cost |
For most developers, Claude Code is the strongest single-session refactoring tool when the task fits within 40–50 files. For refactors that span more than that, the Agent Teams approach — where a coordinator spawns directory-scoped sub-agents — is worth the setup overhead. For a detailed look at how Claude Code compares to Cursor across broader workflows, see our Claude Code vs Cursor comparison.
9. FAQ
How many files can Claude Code refactor in a single session?
Practically speaking, 40–60 files before context pressure becomes a problem. There is no hard cap, but past that range you will start to see inconsistencies in the output — files that were supposed to be updated but were not, or conventions applied differently across the session. For refactors spanning more than 60 files, scope each session to a directory or logical module and run them sequentially.
How much does a multi-file refactor cost in tokens?
A moderate refactor touching 20–30 files typically consumes 80K–150K tokens. At Sonnet 4.5 API rates ($3/million input, $15/million output), that is roughly $0.50–$2.00 per session. On subscription plans (Claude Pro at $20/month, Claude Max at $100/month), it counts against your monthly capacity rather than billing per token.
Does Claude Code run tests automatically after refactoring?
Yes, if you ask it to. A prompt ending with "run the test suite after changes and fix any failures" will trigger Claude Code to run your test command, read the output, and iterate on failing tests. This self-correcting loop is one of its most useful features — it catches missed references that only surface at runtime.
What is the biggest risk of handing a large refactor to Claude Code?
Context drift, and opportunistic scope creep. Claude Code occasionally fixes things it was not asked to fix, which expands the diff and makes review harder. Review changes in logical chunks — not the entire diff at once — and pay extra attention to any file that was not in your original scope but appeared in the change set.
Can Claude Code handle TypeScript type changes across an entire codebase?
Yes, and type propagation is one of its stronger refactoring use cases. It traces type changes through imports and function signatures, runs tsc to surface remaining errors, and iterates. For large TypeScript codebases, providing a CLAUDE.md with your type conventions — whether you use strict null checks, how you handle optional vs undefined — significantly improves the output.