Context Is the New Memory Leak: What 13,730 Tool Calls Taught Me About AI Coding Agents

    Bobby JohnsonMarch 14, 202620 min read

    I was three months into a legacy .NET migration when my AI agent forgot how to read a stored procedure result set. Not a hallucination—it had done the exact same operation successfully forty minutes earlier in the same conversation. I assumed the model was being flaky.

    But here's the thing: the model wasn't the problem. I was.

    I had been using Claude Code to migrate a .NET Framework 4.8 WebForms application to .NET 10 MVC. Hundreds of stored procedures, dozens of Razor views, a middleware pipeline with strict ordering constraints. The kind of project where AI assistance isn't a luxury—it's survival. And for the first eighty sessions, it was going great.

    Then things started getting weird. The agent would lose track of architectural decisions we'd discussed twenty minutes earlier. It would re-read files it had already read. It would make suggestions that contradicted its own rules. I spent a week blaming the model before I finally measured what was happening inside the context window.

    What I found changed how I work with AI agents entirely.


    The Thing Nobody Mentions When You Start Using AI Agents

    When you start using an AI coding agent, everyone talks about prompting. Write clear instructions. Be specific. Provide examples. That's all good advice, and it's about 20% of what actually matters.

    The other 80% is context.

    An AI agent's context window is its working memory. Every system instruction, every file it reads, every command output, every tool call result—all of it gets appended to a single, linear buffer of text. The agent reasons over this buffer to produce its next response. When the buffer is clean and focused, the agent is sharp. When the buffer is polluted with noise, the agent degrades.

    Think of it like a StringBuilder that never gets cleared. Every tool call appends. The runtime keeps running, but the signal-to-noise ratio drops with every append. Eventually the model's attention—its ability to find and use the right information—can't keep up.

    And here's the thing: context windows are smaller than they look.

    Modern models advertise 100,000 to 200,000 token windows. But research on transformer attention patterns shows a U-shaped curve: information at the beginning and end of the context receives strong attention, while content in the middle gets skimmed or missed entirely. In practice, effective attention spans about 30-60% of the advertised capacity. Your 200k window? It's more like 60-120k of reliable attention.

    Garbage in, garbage out. That's the principle that frames everything in this post. It's not just about your prompts—it's about everything that enters the context window. Your instruction files. Your rule documents. Your build output. Your search results. If you pollute the window before the agent writes its first line of code, you've already lost.

    This is where most developers stop thinking and start blaming the model. I know—I did it for weeks.

    Then I started measuring. 194 sessions. 13,730 tool calls. And the data told a very different story.


    Before You Type a Single Prompt

    The most important context management happens before you even start working. If your instruction files eat 20% of the effective attention window at session start, every task you run in that session is operating at 80% capacity.

    The Unbounded CLAUDE.md Problem

    When you first set up an AI coding agent, you create a CLAUDE.md file (or equivalent) with your project instructions. You add coding standards. Architecture notes. Deployment steps. Troubleshooting guides. Tool references. It grows. And grows. Eventually it's 500+ lines, and every session loads ALL of it into context before your first message.

    That's 5-10k tokens consumed before any work begins.

    The problem compounds. You also have rule files, settings, and supplementary docs. My project accumulated 15 detailed rule files—covering everything from C# style conventions to Azure DevOps CLI commands to database verification patterns. Without any structure, loading all of them at session start would burn 15-20k tokens. That's 10-20% of the effective attention window gone before the agent reads a single line of my code.

    This is the .NET equivalent of loading every assembly in the GAC at startup. You wouldn't do that for your application. Why do it for your agent?

    Progressive Disclosure — Dependency Injection for Documentation

    The solution borrows a principle every .NET developer already knows: don't load what you don't need.

    I restructured my rules into a two-level system. Main rule files are concise reference guides—tables, critical constraints, and See: links to detail files. The detail files contain full commands, code examples, and multi-step procedures. The agent loads the main rules at session start and resolves details on demand, only when it encounters a task that needs them.

    This is dependency injection for documentation. You register the interface (concise rule file) at startup. The implementation (detailed code examples) resolves when needed.

    Here's what the structure looks like:

    typescript
    rules/
    ├── token-budget.md           # Main rule: 91 lines (tables, critical rules, See: links)
    ├── tool-rtk.md               # Main rule: 74 lines
    ├── tracking-azure-devops.md  # Main rule: 148 lines
    └── code/
        ├── token-budget/
        │   └── patterns.md       # Detail: 172 lines (ChunkHound, rtk read, targeted Read patterns)
        ├── tool-rtk/
        │   └── commands.md       # Detail: 128 lines (full command reference with examples)
        └── tracking-azure-devops/
            └── workflows.md      # Detail: 178 lines (create, close, query, REST API patterns)

    The main rule files use tables for quick reference and link to the detail files at the end of each section:

    text
    ## Token Cost Reference
    
    | Tool | Avg Growth | P90 Growth | Risk Level | Mitigation |
    |------|-----------|-----------|------------|------------|
    | Read | 1,468 | 5,302 | High | ChunkHound, rtk read, or limit |
    | Edit | 457 | 1,872 | Low | Cheapest mutation tool — prefer it |
    
    See: [code/token-budget/patterns.md](code/token-budget/patterns.md)

    Vertical ordering matters too: critical rules that break things go at the top. Edge cases go at the bottom. Same principle as putting [Required] before [StringLength] on your model properties—validate the most important thing first.

    The numbers: 26 main rule files averaging 87 lines each, plus 11 detail files averaging 102 lines each. Without progressive disclosure, all 3,400 lines load at once. With it, session start loads only the main files—about 2,260 lines—and detail files load individually as needed.

    Indexing Your Codebase — The Difference Between a Table Scan and an Index Seek

    Your codebase exists as files on disk, but the agent can't "see" it. It discovers code through tool calls—Read, Grep, Glob. Each discovery costs tokens. For a small project, this is fine. The agent searches for a function, reads a file, moves on.

    At scale, this falls apart. Searching "authentication" in a 100,000-line project returns 80+ files and consumes 60,000+ tokens during discovery alone—potentially half your effective context window before the agent begins actual problem-solving.

    .NET developers already understand this from SQL Server. A full table scan works for 100 rows. For a million rows, you need an index.

    ChunkHound pre-indexes your codebase into semantically searchable chunks. Instead of reading an entire 15,000-token legacy .aspx view to find three data bindings, the agent queries by meaning and gets focused 1-2k token results with surrounding context.

    bash
    # Instead of reading the whole file (10-21k tokens):
    # Read "C:\legacy\Views\Member\MemberDetail.aspx"
    
    # Search by meaning (1-2k tokens):
    PYTHONIOENCODING=utf-8 chunkhound search \
      --config .chunkhound/admin-portal-v8.json \
      "employer grid data binding" \
      --page-size 5

    For my legacy .NET migration, I set up five index configurations—one for each codebase and database version. The legacy V8 application, the V7 predecessor, the new .NET 10 rewrite, and both database schemas. Index once, search forever.

    The scale guidance I've settled on:

    • Under 10k LOC: Agentic search (Grep, Glob, Read) is fine
    • 10-100k LOC: Semantic search adds real value
    • 100k+ LOC: Essential—autonomous search misses architectural connections

    "Wait. What? A File Read Costs More Than 50 Edits?"

    With input-side context managed, I turned to what happens during a session. I started logging token growth per tool call across my sessions and the data was... not what I expected.

    .NET developers are trained to think of file reads as cheap. On disk, they are. In a context window, Read is the single most expensive common operation.

    Here's the actual data from 194 sessions and 13,730 tool calls:

    ToolAvg Token GrowthP90 GrowthRisk Level
    Read1,4685,302High — 33 calls exceeded 10k
    Agent2,3299,921Medium — varies by prompt
    Write2,2116,607Medium — echoes full content
    ToolSearch1,7964,705Medium — schema injection
    Grep8152,433Low normally
    Edit4571,872Low — cheapest mutation tool
    Bash3391,736Low normally

    Read that again. Edit—the tool that modifies files—averages 457 tokens. Read—the tool that just looks at files—averages 1,468 tokens with a P90 of 5,302. A single file read can cost more than 10 edits. And 33 of my reads exceeded 10,000 tokens each.

    This makes sense when you think about it. Edit sends a diff—old text and new text. Read sends the entire file contents verbatim into the context window. A 500-line controller? That's 10-12k tokens, every time the agent reads it.

    And some files are serial offenders:

    FileEstimated Token Cost
    `azure-pipelines.yml`11-15k per read
    Legacy `.aspx` WebForms views10-21k per read
    Large MVC controllers10-12k per read
    Spec documents10-12k per read
    Large test files10-14k per read

    I had been treating every file read as free. It isn't. Not even close.


    Six Anti-Patterns That Were Killing My Sessions

    Once I had the data, the anti-patterns became obvious. Each one was a slow context leak—easy to miss in the moment, devastating in aggregate.

    1. The Kitchen Sink CLAUDE.md

    Every rule, every edge case, every troubleshooting tip—all loaded into context at session start. My instruction files were eating 15-20% of the effective attention window before the agent wrote a single line of code. The fix: progressive disclosure. Load the summaries. Resolve the details on demand.

    2. The Goldfish Read

    Reading the same file multiple times in one session because the agent (or I) forgot it was already in context. My spec files—10-12k tokens each—got re-read 4 times in a single session. That's ~45,000 wasted tokens. My azure-pipelines.yml got read 6 times across sessions for ~75,000 wasted tokens. The fix: read once, reference from context.

    3. The MSBuild Firehose

    Running dotnet build on a 42-project solution dumps ~12,000 tokens of NuGet restore messages, MSBuild target resolution, and compilation output into context. The agent doesn't need any of it—it needs "build succeeded" or "error CS1234 on line 47." But without compression, all that noise enters the window and stays there. The fix: output compression (more on this in the next section).

    4. The Grep Avalanche

    An unbounded search across the entire solution: Grep pattern="DcpId" with no scope and no limit. Returns 1,000+ matches. Costs 5-20k tokens. The fix: always scope your searches with glob filters and head_limit. Grep pattern="DcpId" glob="*.cs" path="src/Controllers" head_limit=20 returns what you actually need.

    5. The Schema Parade

    Every time the agent needs a deferred tool, it calls ToolSearch to fetch the schema. Each call injects the full JSON schema definition into context—about 1,800 tokens. I watched the agent make 5 separate ToolSearch calls in a row: ~9,000 tokens. A single batched call (select:Read,Edit,Bash,Agent,Grep) costs ~1,800 tokens total. Same result, 80% less context consumed.

    6. The Legacy Full Read

    Reading an entire legacy .aspx WebForms view—10,000 to 21,000 tokens—when you only needed to know what three fields were data-bound to. ChunkHound returns the same answer in 1-2k tokens. The fix: semantic search first, raw read only for small files.

    Anti-PatternToken CostFix
    Kitchen Sink CLAUDE.md15-20% of effective windowProgressive disclosure
    Goldfish Read (spec 4x)~45k wastedRead once, reference from context
    MSBuild Firehose~12k per buildRTK compression (99.1% reduction)
    Grep Avalanche5-20k per searchScope with glob, head_limit
    Schema Parade (5 calls)~9k totalBatch into 1 call (~1.8k)
    Legacy Full Read10-21k per fileChunkHound semantic search

    Tools You Can Install Today

    The anti-patterns point to the solutions. Here are the practical tools I built and adopted—each one something a .NET developer can set up in an afternoon.

    RTK (Rust Token Killer) — Command Output Compression

    RTK is a Rust CLI that filters command output before it enters the context window. It has built-in filters for dotnet, git, docker, playwright, and more. Install it, prefix your commands with rtk, and watch the noise disappear.

    The .NET numbers are staggering:

    CommandBeforeAfterSavings
    `dotnet build` (42 projects)~12,000 tokens~100 tokens99.1%
    `dotnet test` (2,601 tests)~25,000 tokens~100 tokens99.6%
    `git diff` (typical)~5,000 tokens~1,000 tokens80%
    `az boards work-item show`~1,657 tokens~48 tokens97%

    For .NET developers, this is the single biggest immediate win. MSBuild is absurdly verbose—NuGet restore messages, target resolution, assembly info, warnings-as-info. Your agent doesn't need any of it. RTK strips it down to what matters: success/failure and errors.

    PreToolUse Hooks — Making Compression Automatic

    RTK works great, but you have to remember to prefix every command. That's where hooks come in.

    Important caveat: RTK supports the dotnet CLI, but its default hook doesn't support Windows. I implemented my own hook as a Bun (TypeScript) script. This is actually the point—the agent ecosystem gives you hooks as extension points, and you build what your environment needs.

    A PreToolUse hook intercepts every Bash tool call before execution. My hook examines the command, pattern-matches against known command categories, and rewrites it to use RTK automatically. The agent runs dotnet build; the hook rewrites it to rtk dotnet build. No manual intervention required.

    Here's the core rewrite logic—110 lines of TypeScript that handle 35+ command patterns:

    typescript
    const GIT_SUBCMDS = new Set([
      "status", "diff", "log", "show", "add", "commit",
      "push", "pull", "branch", "fetch", "stash", "worktree",
    ]);
    const DOTNET_SUBCMDS = new Set(["build", "test", "restore", "format", "publish"]);
    
    function rewrite(cmd: string): string | null {
      // Extract leading env var assignments (e.g., VAR=val dotnet build)
      const envMatch = cmd.match(/^((?:[A-Za-z_][A-Za-z0-9_]*=[^ ]* +)+)/);
      const pre = envMatch?.[1] ?? "";
      const body = cmd.slice(pre.length);
      const word = firstWord(body);
    
      if (word === "git") {
        const sub = firstWord(stripLeadingFlags(body.slice(3).trimStart()));
        return GIT_SUBCMDS.has(sub) ? `${pre}rtk ${body}` : null;
      }
      if (word === "dotnet") {
        const sub = firstWord(body.slice(6).trimStart());
        return DOTNET_SUBCMDS.has(sub) ? `${pre}rtk ${body}` : null;
      }
    
      // pwsh.exe commands containing az boards → wrap with rtk summary
      if (word === "pwsh.exe" || word === "pwsh") {
        if (/az\s+boards/.test(body) && !/\|\s*(bun|jq|sed)\b/.test(body)) {
          return `${pre}rtk summary ${body}`;
        }
      }
    
      return null;
    }

    The design decisions matter: - Fail silently: The hook's catch block exits with code 0. Hooks must never break the workflow. A broken hook is worse than no hook. - Skip heredocs: Commands containing << can't be safely rewritten—variable substitution and multi-line content make parsing unreliable. - Skip already-piped commands: If the command already pipes to bun, jq, or sed, the user has custom extraction. Don't interfere. - Preserve env vars: PYTHONIOENCODING=utf-8 dotnet build becomes PYTHONIOENCODING=utf-8 rtk dotnet build, not rtk PYTHONIOENCODING=utf-8 dotnet build.

    Custom Filtering — The Azure CLI Example

    RTK has no native az filter. But Azure DevOps CLI commands return massive JSON blobs—~1,657 tokens for a single work item. When you're managing tasks, querying backlogs, and updating stories throughout a migration, this adds up fast.

    I built 5 Bun extraction scripts that parse Azure CLI JSON and return only the fields that matter. Here's the simplest one—az-show.ts, 36 lines:

    typescript
    const raw = await Bun.stdin.text();
    if (!raw.trim()) process.exit(0);
    
    try {
      const data = JSON.parse(raw);
      const f = data.fields ?? {};
    
      console.log(`ID: ${data.id ?? "?"}`);
      console.log(`Title: ${f["System.Title"] ?? "?"}`);
      console.log(`State: ${f["System.State"] ?? "?"}`);
      console.log(`Type: ${f["System.WorkItemType"] ?? "?"}`);
      console.log(`Assigned: ${f["System.AssignedTo"]?.displayName ?? "unassigned"}`);
      console.log(`Iteration: ${f["System.IterationPath"] ?? "?"}`);
      if (f["System.Description"]) {
        const plain = f["System.Description"].replace(/<[^>]*>/g, "").trim();
        console.log(`Description: ${plain.length > 200 ? plain.slice(0, 200) + "..." : plain}`);
      }
    } catch {
      // Not JSON — pass through raw output (error messages, etc.)
      console.log(raw.trim());
    }

    The result: 1,657 tokens of nested JSON becomes 7 lines of key-value text. 97% reduction.

    The hook ties it together with a two-tier strategy: 1. Automatic (lossy): The hook wraps bare az boards commands with rtk summary—~95% savings, good for confirmations 2. Precise (lossless): When you need actual field values, pipe through the extraction script—75-97% savings with zero data loss

    This pattern generalizes. Any verbose CLI tool in your workflow can get a custom extraction script. The hook + extraction combo means compression happens automatically. You never have to remember.

    ChunkHound — Semantic Code Search

    I covered ChunkHound earlier as an indexing strategy, but it's also a practical tool you install and configure. For a legacy .NET migration, the setup looks like this: one config file per codebase, each pointing at a different directory:

    bash
    # Search legacy behavior by meaning
    chunkhound search --config .chunkhound/admin-portal-v8.json \
      "employer grid data binding" --page-size 5
    
    # Search database schema
    chunkhound search --config .chunkhound/database-v8.json \
      "GetEmployersByAgent" --page-size 3
    
    # Deep research with synthesis (uses its own LLM context, not yours)
    chunkhound research --config .chunkhound/admin-portal-v8.json \
      "how does employer management work"

    The research command is particularly powerful—it offloads multi-hop investigation to ChunkHound's own LLM context, traversing semantic relationships across the codebase and returning a synthesized answer. Your context window stays clean while ChunkHound processes thousands of tokens internally.


    The Decision Hierarchy — When to Use What

    Individual techniques are useful. A decision framework that connects them is what changes how you work.

    When you need to read code, which tool do you reach for? The answer depends on what you're reading, how big it is, and whether you need raw text or just information:

    Here's what this looks like for a common .NET scenario—investigating a 500-line controller:

    bash
    # WRONG — reads the entire file (10-12k tokens)
    # Read "src/Controllers/MemberController.cs"
    
    # RIGHT — find the method, read only what you need (~1.5k tokens)
    Grep pattern="public.*IActionResult Edit" \
      path="src/Controllers/MemberController.cs" \
      output_mode="content" -n=true
    # Output: line 142
    
    Read "src/Controllers/MemberController.cs" offset=135 limit=60

    Same information. 85% less context consumed.

    ToolWhen to UseAvg CostRisk
    ChunkHoundLegacy/external codebases1-2k tokensLow
    RTK readLarge files, overview needed10-60% savingsLow
    Grep → ReadSpecific method in large file~1.5k tokensLow
    Agent delegationFull-file analysis needed2-5k tokensMedium
    Raw ReadFiles under 300 lines~1.5k tokensLow
    Unbounded ReadNever10-21k tokensHigh

    GIGO applies here too. The decision hierarchy manages the input side (choose the right tool for the question). RTK and the hooks manage the output side (compress what comes back). Both sides matter. Miss either one and the context window fills with noise.


    What 194 Sessions Taught Me

    Here's what I keep coming back to: context management is systems engineering applied to AI.

    The same skills that make you good at performance profiling, memory management, and application architecture are the same skills that make you effective with AI coding agents. Measure the bottleneck. Identify the waste. Build systems to eliminate it. Verify the improvement. Iterate.

    The irony isn't lost on me. Building a context management system for an AI is itself a software architecture problem. I wrote hooks to intercept tool calls, extraction scripts to compress output, a progressive disclosure system to manage documentation, and indexing configurations to search legacy code. I'm writing code to help a machine write code. It's turtles all the way down.

    But here's the thing: .NET developers are uniquely positioned for this work. You already think in terms of resource management. You already understand dependency injection, middleware pipelines, and the difference between a table scan and an index seek. Context is just another resource—finite, degradable, and worth optimizing.

    Start small. Install RTK. Write a CLAUDE.md. When it gets too long, split it into a progressive disclosure structure. Index your legacy codebase with ChunkHound. Measure your sessions. The system grows from there.

    For a structured walkthrough of how AI coding agents actually work—execution loops, grounding, tool architecture, and context engineering—I recommend agenticoding.ai. It goes deeper than I can in a single post.

    The system I've built is 26 rule files, 6 hooks, and 5 extraction scripts. About 3,400 lines of structured knowledge. It took 194 sessions to get right. But now every session starts better than the last.

    Your AI agent's context window is the most expensive whiteboard you've ever written on. Treat it accordingly.