The Great Rate Validation Rewrite: Real-World Agentic Coding in 48 Hours

Prologue: What Does "Agentic Coding" Actually Look Like?

There's a lot of hype around AI-assisted development. Demos show chatbots writing hello-world apps. Marketing slides promise 10x productivity. But what does it actually look like to build production software with an AI coding partner?

This is that story.

Over 48 hours, I rebuilt an insurance rate validation system from scratch—not by prompting an AI to "write me an app," but by working with Claude Code as a genuine collaborator. I used Steve Yegge's Beads for issue tracking and my own planning system to orchestrate the work.

The result: 23 tasks completed, 3 bugs discovered and fixed, 100% test pass rate, and a system that went from timing out on large customers to validating them in 1.5 seconds.

Here's how it actually happened.

The Setup: Tools for Agentic Development

Before diving into the technical story, let me explain the infrastructure that made this possible.

Beads: Issue Tracking for AI Workflows

Beads is Steve Yegge's issue tracking system designed specifically for agentic workflows. Unlike traditional issue trackers that assume humans are reading and updating tickets, Beads is built for AI agents to read, update, and coordinate through.

bash

# Create an epic
bd create --title="Rate Validation v2 - Complete Rewrite" --type=epic --priority=1

# Create research tasks under the epic
bd create --title="Research Agent Interaction Patterns" --type=task --parent=rates-4zx
bd create --title="Research User Feedback Loop" --type=task --parent=rates-4zx

# Check what's ready to work on
bd ready

# Update status as work progresses
bd update rates-4zx.1 --status=in_progress
bd close rates-4zx.1

The key insight: Beads gives Claude Code persistent memory across sessions. When I start a new session, Claude can run bd ready to see what's available, bd show <id> to understand context, and pick up exactly where we left off.

The Planning System

My planning system is a Claude Code plugin built on a simple premise: "Sub-agents shouldn't get to say 'done' until they're actually done."

The problem with vanilla agentic coding is that Claude will happily report task completion even when tests are failing, builds are broken, or critical edge cases are unhandled. The planning system enforces verification gates that catch this.

plan:new - I describe a feature in natural language. Claude asks clarifying questions, then decomposes it into an epic with discrete tasks stored in Beads.

plan:optimize - Converts high-level plans into executable tasks with detailed prompts—each task gets specific success criteria.

Pull-based execution - I work through tasks sequentially via bd commands. Claude pulls the next task, works on it, and attempts to close it.

Gated completion - Here's the key: tasks can't close until verification passes. A .planconfig file defines the gates:

yaml

# .planconfig - verification gates
verification:
  - build: "bun run build"
  - test: "bun test"
  - lint: "bun run lint"
  - typecheck: "bun run typecheck"

When Claude tries to close a task, these gates run automatically. If the build fails, the task stays open. If tests fail, Claude has to fix them before moving on. No silent failures. No "I'll fix that later."

This changes the dynamic completely. Instead of me reviewing AI-generated code and finding broken tests, Claude catches its own mistakes in real-time and fixes them before claiming completion.

Act I: The Problem Domain

What Is Rate Validation?

Insurance is a document-heavy business. When an employer offers health benefits to employees, the rates they pay are negotiated annually with carriers. These rates end up in multiple places:

Carrier documents - Excel spreadsheets, Word documents, and PDFs from the insurance carrier showing the negotiated rates
Administration database - The system that actually bills employers and processes claims

Rate validation is the process of verifying that the rates in the database match the rates in the carrier documents. It sounds simple, but it's critical:

Financial accuracy - A single mistyped rate can mean thousands of dollars in incorrect premiums over a year
Regulatory compliance - Insurance is heavily regulated; rates must match filed documents
Audit defense - When auditors ask "why did you charge this amount?", you need proof it matches the carrier agreement

The Manual Nightmare

Before automation, rate validation meant a human opening Excel files, finding rate tables, cross-referencing them against database queries, and documenting any discrepancies. For a customer with 16 products across 7 documents, this could take a full day of tedious, error-prone work.

The goal of the rate validation system: automate this entirely. Give it source documents and a database connection, and it produces an executive summary showing which rates match and which don't.

The Technical Problem

The original system (V1) couldn't handle large customers.

typescript

Validation Target: 27 products, 67 documents
Result: Context exhaustion at ~120K tokens
Symptom: Agent timeouts, incomplete validations

V1 tried to do everything in a single AI pass: - Load all documents into memory - Parse Excel, Word, PDF, and PowerPoint inline - Extract rates while parsing - Compare with database - Generate reports

For small customers (3-5 products), this worked fine. But when Moonbeam Mutual came along with 16 medical products across 7 different documents, V1 collapsed under the context window limit.

I had a choice: patch V1 with band-aids, or burn it down and rebuild.

I chose fire. And I chose to do it with Claude Code.

Act II: The Research Phase (R001-R005)

January 13, 2026, 09:00 AM - I created epic rates-4zx in Beads and started the conversation with Claude.

My first instruction: "Don't write any code yet. Let's understand the problem space first."

This is crucial for agentic development. The temptation is to let the AI start coding immediately—it's eager to help, and code feels like progress. But understanding comes before implementation.

I created five research tasks and told Claude to run them in parallel:

Claude spun up parallel exploration agents—one for each research task—reading documentation, examining existing code patterns, and synthesizing findings.

Discovery 1: Prompt-Based Agents

R001 revealed how Claude Code's subagent system actually works:

typescript

export class SubagentSpawner {
  async spawnSubagent(config: SubagentConfig): Promise<SubagentResult> {
    // 1. Load markdown prompt file
    const template = await this.loadPrompt(config.promptFile);

    // 2. Interpolate variables: {{databasePath}} → actual path
    const prompt = this.interpolateVariables(template, config.variables);

    // 3. Spawn agent via Task tool
    const result = await this.invokeTaskTool(prompt, config.type, config.description);

    return result;
  }
}

Key Insight: Agents aren't classes or modules—they're prompt templates with variable interpolation. This meant I could design specialized agents for each document type (Excel, Word, PDF) by writing markdown files, not TypeScript classes.

Discovery 2: SQLite as the Coordination Layer

R003 and R005 together solved agent coordination:

Question: How do agents talk to each other without passing giant JSON blobs?

Answer: They don't talk to each other—they write to a shared SQLite database.

sql

CREATE TABLE products (
  product_id INTEGER,
  product_name TEXT,
  extraction_status TEXT  -- 'pending', 'extracted', 'failed'
);

CREATE TABLE extracted_rates (
  id INTEGER PRIMARY KEY,
  product_id INTEGER,
  tier_name TEXT,
  rate_value REAL,
  source_document TEXT,
  FOREIGN KEY (product_id) REFERENCES products(product_id)
);

This pattern is fundamental to agentic systems: use a database as the shared state, let each agent read what it needs and write what it produces. No complex message passing required.

Discovery 3: Quality Gates

R002 uncovered a critical requirement: the system must never take shortcuts.

V1 had a nasty habit: if it couldn't find a product in the documents, it would just... skip it. Generate a report saying "14 of 16 products validated" and call it done.

This was unacceptable. Missing 2 products could mean millions of dollars in incorrect premiums.

The solution: Quality Gates that stop execution and ask for human help rather than silently failing.

typescript

if (!result.passed) {
  // STOP - prompt user for help
  await gate.onFailure();  // Shows AskUserQuestion prompt

  // User provides hints (document path, product aliases, etc.)
  // RETRY phase with hints
}

typescript

⚠️  GATE FAILURE: Phase 1.5 incomplete

Unable to locate rates for:
  - Voluntary STD
  - Voluntary LTD

I found 14 of 16 products. Would you like me to:
1. Continue with partial results
2. Help me locate these products

I could look at the source documents and respond: "Those are in the Benefits Summary PDF. They're listed as 'Short Term Disability' and 'Long Term Disability'—the database uses abbreviations but the carrier documents use full names."

Claude would then retry the matching phase with this hint, find the products, and continue. The key is that it stopped and asked instead of silently producing an incomplete report.

This is the key to trustworthy agentic systems: design them to fail loudly and ask for help rather than fail silently and produce wrong answers.

Act III: The MVP (M001-M006)

January 13, 2026, 2:00 PM - Research complete. Time to build.

I told Claude: "Let's build the smallest possible system that proves the architecture works. One product, one document type, just phases 0-2."

This "tracer bullet" approach is essential for agentic development. You don't want the AI building a complete system based on assumptions—you want to validate the architecture with minimal investment first.

MVP Scope: - Single product (Rainbow Wellness 2000) - Single document type (Excel) - Phases 0-2 only (no comparison yet) - Anti-shortcut enforcement with user prompt - Target: <50K tokens for full run

Claude built out the project structure:

typescript

.claude/skills/rate-validation-v2/
├── orchestrator/
│   ├── orchestrate-validation.ts    # Main orchestrator
│   ├── gates.ts                     # Quality gate checks
│   └── entry-point.ts               # CLI entry
├── agents/
│   ├── excel-extractor/
│   │   └── AGENT.md                # Agent configuration
│   └── word-extractor/
├── prompts/
│   ├── extract-excel.md            # Excel agent prompt
│   └── extract-word.md
├── schema/
│   └── validation-db.sql           # SQLite schema
└── state/
    └── db-wrapper.ts               # Database utilities

The interesting thing about watching Claude work: it naturally organized the code into clean separations. The orchestrator handles flow control. Agents are defined by prompt templates. The database provides shared state. This wasn't because I told it to—it emerged from understanding the problem during research.

Result: MVP worked perfectly. 48K tokens, all gates passed, Rainbow Wellness 2000 validated.

Act IV: The Full Build (F026-F029)

January 13, 2026, 6:00 PM - MVP success. Full steam ahead.

With the architecture proven, I gave Claude the green light to build out the complete system:

F026: Build V2 Orchestrator Core - All 5 phases
F027: Build V2 Orchestrator Support - Gates, error handling, logging
F028: V2 Command Integration - CLI wrapper /validate-rates
F029: V2 End-to-End Testing - Integration tests with real data

By midnight, the full V2 orchestrator was complete. I ran the first test:

typescript

$ /validate-rates --client=MM --year=2026 --validationId=1001 --sourceDir=validations/moonbeam-2026/sources

Phase 0: Scope Definition
  ✓ Cataloged 16 products
  ✓ Inventoried 7 documents
  ✓ Mapping products to documents...

⚠️  GATE FAILURE: Phase 0 incomplete

Products mapped: 0/16 (0%)

Unable to map products to source documents by filename.

Wait. What?

Act V: The Confident Hallucination

January 14, 2026, 8:00 AM - I stared at the error with Claude.

This is what "vibe coders" call a hallucination—but it's more accurately described as underspecification. I told Claude that Phase 0 needed to "map products to documents." I didn't specify how. Claude filled in the gap with a reasonable-sounding approach: match product names to filenames.

The problem? I never validated this assumption. Claude built the entire mapping system with complete confidence:

typescript

// Claude's confident implementation
async mapProductsToDocuments() {
  for (const product of products) {
    for (const document of documents) {
      if (document.file_path.toLowerCase().includes(product.product_name.toLowerCase())) {
        await this.db.execute('UPDATE products SET source_document = ?', [document.file_path]);
      }
    }
  }
}

Clean code. Good variable names. Proper async/await. And it could never work.

Insurance product names don't appear in filenames:

typescript

Product Name: "Accident Low Plan"
Filename: "Unicorn United Network - Accident Plan Summary.pdf"

Product Name: "Rainbow Wellness 2000"
Filename: "MM Rate Table Effective 1.1.2026.xlsx"

Match? ❌ NO

Product names appear in the content of documents—in table headers, sheet names, PDF sections—not in filenames. Claude had no way to know this. I never told it, and it never asked.

The Chicken-and-Egg Problem: - Phase 0 requires mapping products to documents - Mapping requires seeing document contents - Document contents are in Phase 1 (extraction) - Phase 1 can't run until Phase 0 passes

The system was deadlocked before it even started—and Claude had built it with complete confidence.

But wait—didn't the MVP catch this? No. And that's the insidious part.

The MVP used a single product ("Rainbow Wellness 2000") and a single Excel file that I had specifically named to contain "Rainbow" in the filename for testing convenience. The filename-based matching worked perfectly in the MVP. All gates passed. I declared victory and moved to full implementation.

The flaw only appeared when we hit real customer data with real filenames like "MM Rate Table Effective 1.1.2026.xlsx"—filenames that follow carrier conventions, not my test conventions.

This is the danger of agentic coding without guardrails. Claude doesn't know what it doesn't know. It will fill gaps in your specification with plausible-sounding solutions and keep building. The MVP's verification gates all passed because my test data was accidentally designed to work with Claude's assumption.

The lesson: be specific about the "how," not just the "what." And test with real data, not convenient data. If I had used actual carrier documents in the MVP instead of my sanitized test files, the chicken-and-egg problem would have surfaced immediately.

Act VI: The Pivot (F030-F037)

January 14, 2026, 9:00 AM - Emergency architecture review.

The solution was radical: reverse the entire flow.

Old (Broken) Architecture: 1. Phase 0: Catalog + Map products to documents (FAILS HERE) 2. Phase 1: Extract documents 3. Phase 2: Compare rates

New (Working) Architecture: 1. Phase 0: Catalog products + Inventory documents (no mapping) 2. Phase 1: Universal extraction (extract everything) 3. Phase 1.5: Content-based matching (NEW) 4. Phase 2: Compare rates 5. Phase 3: Generate summary

The key insight: Extract first, match later.

I created new tasks in Beads: - F030: Remove Phase 0 Product Mapping Gate - F031: Universal Document Extraction - F032: Build Fuzzy Matching Utilities - F033: Implement Phase 1.5 Content-Based Matching

Claude implemented fuzzy string matching to connect extracted content back to products:

typescript

class FuzzyMatcher {
  calculateSimilarity(str1: string, str2: string): number {
    // Normalize, tokenize, calculate Jaccard similarity
    // Boost confidence for numeric matches (e.g., "2000" in both strings)

    if (this.hasNumericMatch(tokens1, tokens2)) {
      return Math.min(similarity * 1.5, 100);  // 50% boost
    }
    return similarity;
  }
}

Example:

typescript

Product: "Rainbow Wellness 2000"
Extracted: "$2000 Plan"

Tokens: ["rainbow", "wellness", "2000"] vs ["2000", "plan"]
Numeric match: "2000" appears in both
Similarity: 33% base + 50% numeric boost = 75% confidence

Act VII: The Bugs

January 14, 2026, 11:00 AM - Architecture complete. Time for end-to-end testing.

Three bugs emerged—and this is where the human-AI collaboration really paid off.

Bug #1: The Phantom Matches

Symptom: Phase 1.5 completed successfully, but Phase 2 failed with "No matched products found."

Claude and I investigated together. The matching logic created product_matches records but forgot to actually update the extracted_rates.product_id column.

typescript

// BUG: Only inserting into product_matches
await this.db.execute(
  'INSERT INTO product_matches (product_id, ...) VALUES (?, ...)',
  [bestMatch.productId, ...]
);

// FIX: Also update extracted_rates
await this.db.execute(
  'UPDATE extracted_rates SET product_id = ? WHERE tier_name = ?',
  [bestMatch.productId, bestMatch.extractedText]
);

I filed this as rates-nxp in Beads. Claude fixed it immediately.

Bug #2: The $6000 Typo

Symptom: Product "Rainbow Wellness 3000" not found in Excel file.

This wasn't a code bug—it was a data bug. The Excel file had a typo: "$6000 Plan" instead of "$3000 Plan". The rates themselves were correct, just the label was wrong.

This is something Claude couldn't have found on its own—it required human domain knowledge to recognize the typo. But once I identified it, Claude helped me trace through the system to confirm the diagnosis.

Bug #3: The Missing Rate Type

Symptom: All "Carrier Premium" rates showed as 0.00.

The Excel extractor was only extracting one column type. Claude fixed it by detecting rate types from column headers dynamically:

typescript

for (let col = 1; col <= headers.cellCount; col++) {
  const header = headers.getCell(col).value?.toString().toLowerCase() || '';

  if (header.includes('firm') || header.includes('billed')) {
    rateColumns.push({ col, type: 'firm_billed' });
  } else if (header.includes('carrier') || header.includes('premium')) {
    rateColumns.push({ col, type: 'carrier_premium' });
  }
}

Act VIII: Victory

January 14, 2026, 1:00 PM - All bugs fixed. Final test run.

bash

$ /validate-rates --client=MM --year=2026 --validationId=1001

Phase 0: Scope Definition
  ✓ Cataloged 16 products
  ✓ Inventoried 7 documents
  ✓ Phase 0 complete (0.2s)

Phase 1: Universal Document Extraction
  ✓ Extracted MM Rate Table.xlsx (264 rates)
  ✓ Extracted 6 other documents
  ✓ Phase 1 complete (0.4s)

Phase 1.5: Content-Based Product Matching
  ✓ Matched 3/3 products in Excel file (100%)
  ✓ Phase 1.5 complete (0.1s)

Phase 2: Database Rate Extraction
  ✓ Retrieved 264 database rates
  ✓ Phase 2 complete (0.3s)

Phase 3: Rate Comparison
  ✓ Match rate: 100%
  ✓ Discrepancies found: 0
  ✓ Phase 3 complete (0.3s)

✅ VALIDATION COMPLETE
   Total time: 1.5s
   Rate accuracy: 100%

I ran the full test suite against all customer data:

typescript

✓ Moonbeam Mutual 2026: 100% match rate, 1.5s
✓ Starlight Services 2025: 87% match rate, 1.2s
✓ Pancake Pete's 2025: 91% match rate, 1.8s
✓ Wizard's Alliance 2025: 100% match rate, 0.9s

All tests passing ✅

Epilogue: What I Learned About Agentic Development

What Worked

Research Before Code - The 5 parallel research tasks saved massive rework. Understanding the problem space before implementation is even more important with AI—the AI is eager to code, and it's the human's job to pump the brakes until understanding is solid.

Beads for Persistent Context - Issue tracking gave Claude memory across sessions. When I came back after a break, Claude could run bd show rates-4zx and immediately understand where we were.

Tracer Bullet MVP - Building the smallest end-to-end system first validated architecture before full implementation. This caught the chicken-and-egg problem early.

Verification Gates - The planning system's gated completion meant Claude couldn't claim "done" with failing tests. Every task had to pass build, test, and lint checks before closing. This caught the phantom matches bug immediately—the tests failed, so the task stayed open until Claude fixed it.

SQLite as Coordination Layer - Using a database instead of JSON files eliminated entire classes of bugs.

Willingness to Pivot - When the Phase 0 deadlock emerged, we didn't patch it—we redesigned the flow. Claude was able to rapidly implement the new architecture because the code was well-structured.

What Didn't Work

Filename-Based Mapping - Assuming product names would appear in filenames was naive. Real-world documents have arbitrary naming conventions. This assumption was mine, not Claude's—a reminder that the human brings domain assumptions that need validation.

Assuming the Schema Was Complete - The "carrier premium" bug happened because I didn't think to mention there were multiple rate types. Claude built exactly what I asked for, which wasn't what I needed.

The Real Lesson

Agentic coding isn't about replacing human developers with AI. It's about creating a collaboration where:

The human provides domain knowledge, sets priorities, catches real-world edge cases, and makes judgment calls
The AI provides implementation speed, pattern recognition, parallel exploration, and tireless debugging
The tools enforce accountability—Beads provides persistent memory across sessions, and the planning system's verification gates ensure Claude can't cut corners

That last point is crucial. Without verification gates, agentic coding becomes a game of "the AI says it's done, but is it really?" With gates, the answer is objective: if tests pass and the build succeeds, it's done. If not, Claude keeps working.

The rate validation system wasn't built by Claude Code—it was built with Claude Code. That distinction matters.

The Bigger Picture

Consider what this epic would have looked like in a traditional development shop.

A team of developers—let's say three engineers, a tech lead, and a QA person—tackling a complete rewrite of a document processing system. How long would that take? In my experience: 2-3 months minimum. Requirements gathering, architecture design, sprint planning, code reviews, QA cycles, bug fixes, more QA cycles.

Now consider the pivot. When the chicken-and-egg problem emerged, we completely redesigned the phase architecture and rebuilt the matching system. In a traditional team, that's a difficult conversation. "We need to throw away two weeks of work and start over." There's organizational inertia, sunk cost fallacy, pressure to ship something. The temptation is to patch around the fundamental flaw rather than fix it.

With agentic coding, the pivot took 4 hours. There was no ego invested in the wrong solution. No difficult conversations about whose design was flawed. Just: "This doesn't work, let's try a different approach." Claude rebuilt the entire phase architecture without complaint, and we moved on.

Yes, agentic coding has flaws. Claude confidently built a system that could never work. It filled gaps in my specification with plausible-sounding nonsense. It needed constant verification and human oversight.

But these are a new set of problems—and they're problems you can iterate on quickly. When Claude gets something wrong, you fix it and move on. When a traditional team gets something wrong, you have meetings about it.

And here's the thing: the models get better every few months. The Claude I'm working with today is dramatically more capable than the Claude from a year ago. Apply something like Moore's Law to this trajectory, and it's not hard to imagine a future where agentic coding is simply how software gets built.

We're not there yet. The human in the loop is still essential—for domain knowledge, for catching hallucinations, for knowing when to push back on the AI's confident wrongness. But the gap is closing.

Forty-eight hours. Twenty-three tasks. One developer. One AI.

That's not the future of software development. That's now.

By The Numbers:

Metric	V1 (Old)	V2 (New)
Context Usage	120K+ tokens	<60K tokens
Large Customer Support	Fails	Works
Runtime	N/A (timeout)	1.5s
Match Accuracy	Manual aliases	87%+ automatic
Development Time	Weeks	48 hours

Epic Completion:

typescript

Epic: rates-4zx - Insurance Rate Validation v2 - Complete Rewrite
Status: CLOSED
Duration: 48 hours (Jan 13-14, 2026)

Tasks Completed: 23/23
├── Research Phase (R001-R005): 5/5
├── MVP Phase (M001-M006): 6/6
├── Build Phase (F026-F029): 4/4
├── Pivot Phase (F030-F037): 7/7
└── Polish Phase: 1/1

Bugs Found & Fixed: 3
Test Coverage: 100% (4/4 customers passing)

Resources

Beads Issue Tracking - Steve Yegge's issue tracker for agentic workflows
Planning System - My workflow for orchestrating AI-assisted development
Claude Code - Anthropic's CLI tool for agentic coding

This is a real project built over 48 hours. Customer names have been changed to protect confidentiality, but the technical challenges, bugs, and solutions are exactly as they happened.

The hero image shows the Beads interface tracking all 23 tasks through completion—proof that agentic development isn't just demos and hype, but real production software.