Skip to main content
Deep Dive·14 min read

How Claude Code Handles Multi-File Refactoring: A Practical Deep-Dive

Handing a 47-file refactor to an AI agent and watching it work is either impressive or terrifying, depending on what it touches. This is a close look at how Claude Code's agentic loop actually manages cross-file changes — including where it works, where it quietly goes wrong, and what a session costs in tokens.

TL;DR — Key Takeaways

  • • Claude Code uses a read-plan-edit-verify loop: it reads affected files, writes a plan, edits in batches, then runs your tests to confirm
  • • Reliable up to roughly 40–60 files per session; beyond that, context drift becomes a real concern
  • • A 20–30 file refactor costs ~80K–150K tokens, roughly $0.50–$2.00 at API rates on Sonnet 4.5
  • • Strongest at: import renames, type propagation, service extraction, test-file updates
  • • Weakest at: tightly coupled logic where dependencies are implicit, and anything requiring business domain knowledge it cannot read from the code
  • • A CLAUDE.md file with explicit refactoring constraints cuts error rate noticeably

1. How the agentic loop works for refactoring

When you give Claude Code a refactoring task — "rename UserService to AccountService across the codebase, update all imports and tests" — it does not just run a find-and-replace. The agentic loop goes through several distinct phases.

Phase 1: Discovery

Claude Code first greps through the project to find every file that references the thing being changed. It uses the Glob and Grep tools internally — the same tools exposed in custom skills. For a service rename, it will find: the source file itself, import statements, type references, mock files in tests, any barrel exports, configuration files where the name appears, and sometimes documentation files.

This discovery phase is where Claude Code has a genuine edge over manual refactoring. It catches usages that are easy to miss — a jest.mock('../services/UserService') buried in a test helper three directories deep, or a string reference in an error message. A developer doing this manually would typically rely on IDE "find references," which misses string usages and some dynamic patterns.

Phase 2: Planning

Before editing anything, Claude Code writes out a plan. In the terminal you will see something like: "I found 23 files that reference UserService. I'll rename the source file first, then update imports in batches, update tests, and run the test suite to confirm." This planning step is not just cosmetic — it structures the edit sequence to avoid cascading errors. If you edit imports before renaming the source file, TypeScript will throw errors that interfere with subsequent edits.

Phase 3: Editing in batches

Edits happen file by file using the Edit tool, which performs exact string replacements rather than rewriting entire files. This is meaningful: it reduces the chance of accidentally reformatting unrelated code or losing comments. For each file, Claude Code reads the current content, identifies the specific strings to change, and applies targeted replacements.

For larger batches — say, 40+ files — Claude Code tends to work in logical groups: source files first, then barrel exports, then application code, then tests. It maintains a running mental model of what it has and has not yet changed, though this is where context pressure starts to build.

Phase 4: Verification

After edits, Claude Code runs whatever verification you have available. For TypeScript projects it will run tsc --noEmit to surface type errors. If you have a test suite, it runs it and reads the output. Any failures send it back into an editing loop: read the error, identify the affected file, fix it, re-run. This self-correcting loop is what separates agentic refactoring from a simple macro.

The verification loop also means Claude Code will catch things it missed in discovery — if a test file breaks because it was not updated, that failure surfaces the file, and Claude Code adds it to the set of changes.

2. Three real refactoring examples

Example A: Service rename across a Next.js app

A Next.js 15 application with a UserService referenced in 31 files (9 API routes, 14 component files, 8 test files). The prompt: "Rename UserService to AccountService. Update all imports, update the class name and file, update mock paths in tests."

  • Discovery found 31 files correctly, plus 2 additional string references in error messages that a simple find-and-replace would have missed
  • Editing took approximately 8 minutes end-to-end including the verification loop
  • Token consumption: ~95K tokens. At Sonnet 4.5 API rates, roughly $0.60
  • One test failed after initial edits — a mock path that used a string literal. Claude Code caught it on first test run and fixed it
  • Zero manual intervention required

Example B: Extracting a shared utility from duplicated code

Five components each contained a local copy of a formatCurrency function with slight variations. The prompt: "Extract a shared formatCurrency utility into src/utils/currency.ts. Replace all five local implementations with imports from the shared module. Make sure the behavior matches the most complete version."

  • Claude Code read all five implementations, identified differences, and chose the most complete one as the canonical version
  • It created the new utility file, updated all five components, and added tests for the utility
  • Token consumption: ~62K tokens, roughly $0.35
  • One component had a locale parameter that the other four did not — Claude Code preserved it as an optional argument rather than dropping it, which was the correct call
  • This kind of "figure out what the unified version should look like" judgment is where Claude Code surprises people

Example C: TypeScript strict mode migration (partial failure)

A 180-file codebase being migrated from strict: false to strict: true. The prompt: "Enable strict mode in tsconfig.json and fix all TypeScript errors."

  • The first 40 files went smoothly — straightforward null checks and type annotations
  • Around file 55, Claude Code began adding as any casts to silence errors rather than fixing them properly — a sign of context pressure
  • Total token consumption: ~380K tokens before the session was halted, roughly $2.80 at API rates
  • The correct approach here was to split the task by directory — run Claude Code against src/components/ in one session, src/api/ in another. When done this way in follow-up sessions, each 30–40 file batch worked cleanly

Example C illustrates the primary limitation — and also the workaround. Claude Code is not weaker than alternatives on large refactors; it just needs the task scoped appropriately.

3. How we tested

We ran Claude Code (Sonnet 4.5, accessed via Claude Max subscription and directly via API) against a range of refactoring tasks across three codebases: a Next.js 15 application with ~200 files, a Node.js microservice with ~80 files, and a React component library with ~60 files.

For each refactor, we tracked: files discovered vs files actually needing changes (to measure discovery accuracy), token consumption via the API usage dashboard, time from prompt submission to passing tests, and the number of manual interventions required. We ran each refactoring task type at least twice — once with a plain prompt and once with a detailed CLAUDE.md that specified constraints.

We also compared the same tasks run in Cursor's Composer mode and Aider (using the same Sonnet 4.5 model) as reference points, though that comparison is abbreviated here — see our AI coding tools guide for the full breakdown.

4. Token costs and time savings

Token cost is the practical constraint most developers hit first. Here is what we measured across several refactoring categories:

Task typeToken rangeAPI cost (Sonnet 4.5)
Service rename (20–35 files)60K–120K$0.35–$0.80
Utility extraction (5–10 files)40K–80K$0.20–$0.50
TypeScript strict migration (30–50 files)150K–300K$0.90–$2.00
Test suite generation (per module)50K–100K$0.30–$0.65
API interface change (cross-cutting)120K–250K$0.75–$1.70

For time savings, a 31-file rename that takes Claude Code about 8–12 minutes would take an experienced developer 45–90 minutes manually — including the inevitable missed reference that surfaces as a runtime error the next day. For test generation, Claude Code writes 80–120 test cases in the time a developer might write 15–20.

On Claude Pro or Max subscription plans these sessions count against monthly usage limits rather than billing per token. If you run multi-file refactors several times a week, Claude Max ($100/month) is worth comparing against API costs — at 5–10 heavy sessions per week, the subscription often comes out cheaper. Group plan services like GamsGo can reduce Claude Pro costs by 30–50% for teams, which helps if you're evaluating whether the economics make sense before committing.

5. Refactoring task comparison table

How Claude Code performs across different refactoring types, rated on discovery accuracy, edit quality, and reliability at scale:

Refactoring typeFile discoveryEdit qualityNotes
Import / module renameExcellentExcellentCatches string refs that IDE tools miss
TypeScript type propagationExcellentGoodOccasionally uses as any under pressure
Utility extractionGoodExcellentSynthesizes unified API from variants well
Test file updatesGoodGoodMock path updates reliable; assertion logic occasionally mismatches
Large-scale strict mode migrationGoodMixedNeeds task chunking per directory; 180+ files in one shot causes drift
Business logic restructuringFairRiskyRequires domain knowledge that cannot be inferred from code
Framework migration (e.g., React class → hooks)ExcellentGoodHandles mechanical parts well; complex lifecycle logic needs review
API contract change (cross-service)GoodFairWorks within one repo; cross-repo requires manual setup

6. Setting up CLAUDE.md for refactoring work

A CLAUDE.md file in the project root acts as persistent instructions that Claude Code reads at the start of every session. For refactoring work, a well-crafted CLAUDE.md cuts the error rate considerably. Here is what makes the difference:

Declare off-limits files

Explicitly list files Claude Code should not touch — generated files, vendor code, migration scripts. Without this, it may "helpfully" update files that should not be changed. Example: NEVER edit files in /generated/ or any *.generated.ts file.

Define naming conventions

If your codebase uses specific conventions that differ from defaults — PascalCase for all exports, specific barrel export patterns, kebab-case file names — state them explicitly. Claude Code will follow your project conventions only if it can read them.

Specify the test command

Tell Claude Code exactly how to run tests: npm test -- --passWithNoTests or vitest run. Ambiguity here wastes tokens as Claude Code tries multiple commands to figure out what works.

Set scope constraints

For session-level tasks, include scope in the CLAUDE.md or in the prompt itself: "This session: only work within src/services/ and src/api/. Do not touch src/components/." Scope constraints prevent the kind of well-intentioned over-reach where Claude Code refactors things you did not ask it to.

7. Honest limitations

Context drift past ~50 files

The agentic loop accumulates context across every file read and every edit made. Past roughly 50 files in a single session, Claude Code starts to lose track of decisions from earlier in the session — a convention it established, a file it already handled. The symptom is inconsistent output: some files get the update, others do not, without a clear pattern. Split large refactors into directory-scoped sessions to avoid this.

No cross-repository reach by default

Claude Code operates within a single directory tree. If your codebase spans multiple repositories — a monorepo setup aside — you need to run separate sessions and manually coordinate the changes. There is no built-in mechanism to propagate a change across repo boundaries in a single operation.

Business logic requires domain context you must supply

For structural refactors — renames, extractions, type migrations — Claude Code is strong because the correct output can be inferred from the existing code. For business logic refactors, the correct output requires knowing things the code does not tell you: why a calculation works a certain way, what edge case prompted a specific conditional. Claude Code will make a plausible choice, which may not be the right one. Always review business logic changes with extra care.

Test suites that take more than a few minutes expose a gap

Claude Code's verification loop assumes tests are fast. If your full test suite takes 10–20 minutes, the token cost of the verification loop escalates significantly. The workaround: direct Claude Code to run a scoped subset of tests (npm test -- --testPathPattern="services") rather than the full suite during the editing phase.

Diffs require careful review — it is not conservative by default

Claude Code will sometimes make "opportunistic improvements" while doing a refactor — adding a missing null check here, cleaning up an inconsistent import there. These are usually harmless and sometimes genuinely useful. But they make diff review harder because the change set is larger than you asked for. Use git diff --stat before committing to see the scope.

8. Claude Code vs alternatives for refactoring

Several tools handle multi-file refactoring. The right choice depends on what kind of refactor and what your workflow looks like:

ToolRefactoring approachStrengthWeakness
Claude CodeAgentic loop, terminal-nativeDiscovery accuracy, self-correctionContext drift at scale, token cost
Cursor ComposerIDE-integrated, visual diffsReal-time feedback, low frictionLess autonomous, needs more direction
AiderCLI, git-native, model-agnosticCost control, model flexibilityWeaker discovery, more manual setup
IDE rename (VS Code)Semantic, language-server backedPerfect precision for symbol renamesCannot handle logic changes or string refs
Claude Code Agent TeamsMulti-agent parallel orchestrationScales past single-agent context limitsSetup overhead, higher total token cost

For most developers, Claude Code is the strongest single-session refactoring tool when the task fits within 40–50 files. For refactors that span more than that, the Agent Teams approach — where a coordinator spawns directory-scoped sub-agents — is worth the setup overhead. For a detailed look at how Claude Code compares to Cursor across broader workflows, see our Claude Code vs Cursor comparison.

9. FAQ

How many files can Claude Code refactor in a single session?

Practically speaking, 40–60 files before context pressure becomes a problem. There is no hard cap, but past that range you will start to see inconsistencies in the output — files that were supposed to be updated but were not, or conventions applied differently across the session. For refactors spanning more than 60 files, scope each session to a directory or logical module and run them sequentially.

How much does a multi-file refactor cost in tokens?

A moderate refactor touching 20–30 files typically consumes 80K–150K tokens. At Sonnet 4.5 API rates ($3/million input, $15/million output), that is roughly $0.50–$2.00 per session. On subscription plans (Claude Pro at $20/month, Claude Max at $100/month), it counts against your monthly capacity rather than billing per token.

Does Claude Code run tests automatically after refactoring?

Yes, if you ask it to. A prompt ending with "run the test suite after changes and fix any failures" will trigger Claude Code to run your test command, read the output, and iterate on failing tests. This self-correcting loop is one of its most useful features — it catches missed references that only surface at runtime.

What is the biggest risk of handing a large refactor to Claude Code?

Context drift, and opportunistic scope creep. Claude Code occasionally fixes things it was not asked to fix, which expands the diff and makes review harder. Review changes in logical chunks — not the entire diff at once — and pay extra attention to any file that was not in your original scope but appeared in the change set.

Can Claude Code handle TypeScript type changes across an entire codebase?

Yes, and type propagation is one of its stronger refactoring use cases. It traces type changes through imports and function signatures, runs tsc to surface remaining errors, and iterates. For large TypeScript codebases, providing a CLAUDE.md with your type conventions — whether you use strict null checks, how you handle optional vs undefined — significantly improves the output.

Written by Jim Liu

Full-stack developer in Sydney. Hands-on AI tool reviews since 2022. Affiliate disclosure