Back to Home

I swapped Claude Opus for Llama 3.3 70B. The cleanup took three days.

9 min read
Sumeet Zankar

Sumeet Zankar

AI Solution Architect & Full-Stack Developer

I migrated one of my AI agents from Anthropic's Claude Opus 4.5 to NVIDIA NIM's Llama 3.3 70B Instruct last week. The migration itself was a one-line config change. The cleanup took three days.

Most of the migration content online tells you the easy part: change the provider, set the API key, swap the model string, restart. That part is genuinely simple — modern agent runtimes (in my case OpenClaw) abstract the provider differences behind a common tool-use schema and a normalized message format. You change anthropic/claude-opus-4-5 to nvidia-nim/meta/llama-3.3-70b-instruct in your config and the runtime takes care of the rest.

The hard part is everything that worked with Opus and quietly stops working with Llama. None of it shows up as an error. The agent just behaves slightly off, in five specific ways I hit one after another.

The agent in question

Briefly so the rest makes sense: the agent is a "personal brand manager" that runs once a day at 8 AM IST as an isolated cron job. It does four web searches via Gemini grounding, reads a memory file and a draft directory, dedups against the last three days of briefings, and posts a structured AI/tech news digest to a WhatsApp group. It uses tool calls heavily and has a 17K-token system prompt with workspace context.

On Opus 4.5 it had been running cleanly for several weeks. The migration target was Llama 3.3 70B Instruct on NVIDIA NIM — OpenAI-API-compatible at https://integrate.api.nvidia.com/v1.

Failure 1: Format template leakage

The first day post-migration, the agent shipped a briefing to WhatsApp that ended with this line:

Send exactly ONE message with the formatted briefing.

Yes — the agent included its own instructions in the message it sent.

The prompt had this structure:

STEP 4 — FORMAT: Use this template:

[template content]

💡 QUICK TAKE: [your insight]

---

STEP 5 — SEND: Send exactly ONE message with the formatted briefing.

On Opus, the model treated the --- as a visual end-of-template separator and STEP 5 as instruction-for-the-model. On Llama, the model treated the --- as the content end-of-template separator and faithfully copied "STEP 5 — SEND: Send exactly ONE message with the formatted briefing" as part of the message template. The agent's structured output dutifully emitted those lines verbatim to WhatsApp.

The fix: explicit ⟪BEGIN MESSAGE TEMPLATE⟫ ... ⟪END MESSAGE TEMPLATE⟫ markers around the template, plus an explicit "Your output starts with [first line] and ends with [last line]. NOTHING ELSE." Opus didn't need this. Llama did.

The lesson: format conventions you take for granted on the strong model are conventions, not laws of physics. A --- divider is not a universally-recognized end-of-template marker. The cheaper model is more likely to take your prompt at face value, including the structural cues you didn't realize you were giving it.

Failure 2: Suppression-token violations

OpenClaw has a runtime mechanism to suppress no-op cron messages: if the agent emits exactly the literal token HEARTBEAT_OK (alone, with optional whitespace), the runtime strips it and skips delivery. The cron fires, the agent runs its checks, finds nothing to report, emits HEARTBEAT_OK, and the WhatsApp group stays quiet.

Opus emits this token cleanly. Llama emits something like:

All four checks passed silently. The 48-hour anti-spam window applies. HEARTBEAT_OK

The runtime sees HEARTBEAT_OK at the end and strips it — but the rest of the message remains and gets delivered. So the user sees the model's reasoning instead of silence.

The fix: a DEFAULT OUTPUT block at the top of every heartbeat prompt that explicitly says: When the conditions are not met, your ENTIRE output is the literal token HEARTBEAT_OK — alone, on its own line, with no other text. Do NOT explain. Do NOT show your reasoning. Do NOT say "STAY SILENT" or "anti-spam window applies". Just HEARTBEAT_OK and nothing else.

The lesson: runtime contracts based on token-literal output need explicit prompt enforcement on cheaper models. On Opus, the implicit pattern (token-as-control-signal, never-as-content) is easy to follow. On Llama, the model wants to explain itself, and "stay silent" gets interpreted as "describe the silence."

Failure 3: Multi-document reasoning degradation

The briefing prompt has a dedup step: read the last several days of briefing entries from a JSONL log file, identify the recent top stories, and avoid repeating them as today's top story. The instruction was:

Run tail -300 /path/to/log.jsonl. Find the last 3 entries with action:finished. Read their summary fields. Drop any candidate today's-top-story that's a repeat.

On Opus, this works. The model reads the file, finds the 3 most recent entries, extracts headlines, compares.

On Llama, the model ran tail -300, got back ~26 entries (the file has fewer than 300 lines, so tail -300 returned the whole file), and apparently attended to the first entries in the response — the oldest — instead of the last. So when today's candidate top story was "NVIDIA & Google optimize Gemma 4 for Edge AI" and yesterday's top story was the same, Llama had no recollection of yesterday because its attention was on entries from three weeks ago.

The fix: replace the loose "find the last 3" instruction with a deterministic shell pipe that returns only what the model needs:

tail -3 /path/to/log.jsonl | python3 -c "
import json, sys, re
for line in sys.stdin:
    e = json.loads(line)
    if e.get('action') != 'finished': continue
    s = e.get('summary') or ''
    m = re.search(r'TOP STORY[:s]*([^
]+)', s)
    if m: print(m.group(1).strip())
"

This returns at most three lines, each a recent top-story headline. The instruction becomes: Today's top story MUST be different from every line in this output.

The lesson: don't give cheaper models open-ended reasoning over large structured documents. Pre-process the data into the smallest, most explicit comparison-set you can. The model is a judge, not a parser.

Failure 4: No prompt caching

Anthropic's prompt caching is one of the underrated reasons frontier-model agents are economical. Re-running the same agent loop with the same system prompt + tools + workspace context gets you the cached price (typically ~10% of the standard input rate) on everything but the new turn.

NVIDIA NIM, as an OpenAI-API-compatible provider, has no equivalent. Every cron fire re-bills the full 17K-token system prompt at the standard rate.

For the brand-manager, this turned what looked like a "switch to a cheaper provider" win into roughly a wash. Llama 3.3 70B's per-token rate is lower than Opus, but losing 90% caching on a context-heavy workflow eats the difference. For lower-context agents (a heartbeat that loads a 500-token HEARTBEAT.md and runs a few tool calls), the win is real. For the briefing, not as much.

The lesson: prompt caching is provider-specific and feature pricing matters more than per-token pricing for agentic workflows. Context-heavy agents amortize their overhead across cached calls; without caching, the per-turn cost looks very different.

Failure 5: Content density

This is the hardest one to characterize, because it's not a binary failure. The agent works. The format is right. The headlines are real. But the output is visibly thinner.

On Opus, the briefing's top-story bullet block looked like:

🔥 TOP STORY: NVIDIA & Google Optimize Gemma 4 for Edge AI
• Google released Gemma 4 on April 2nd as their most capable open
  model family
• Gemma 4's E2B and E4B variants are designed for ultra-efficient,
  low-latency inference
• Deployment options span RTX-powered PCs, DGX Spark personal AI
  supercomputers, Jetson Orin Nano edge modules
• NVIDIA is positioning this as infrastructure for "always-on"
  local AI assistants — reducing cloud dependency and token costs
• Apache 2.0 licensed, supporting vLLM, Ollama, llama.cpp, and
  NVIDIA NIMs

On Llama, with the same prompt:

🔥 TOP STORY: Microsoft and OpenAI Amend Partnership
• Microsoft remains OpenAI's primary cloud partner
• OpenAI gains flexibility to serve products across any cloud
• The revised agreement aims for broader delivery of AI benefits

Three short bullets vs five concrete ones. And critically, the "Blog angle" output — supposed to be a specific contrarian thesis — defaulted to platitudes:

  • Opus: "The Gemma 4 + edge story is really about NVIDIA's hardware moat collapsing, not Google's openness."
  • Llama: "The Future of AI Partnerships — How Microsoft and OpenAI Are Redefining Collaboration."

Tightening the prompt with explicit anti-pattern bans helps:

Blog angle MUST be a specific argumentative thesis, not "The Future of X" / "How X Is Redefining Y" / "Y in 2026" — those are platitudes.

But it's a partial fix. The model just doesn't synthesize the same way.

The lesson: prompt tightening can recover format and structure. It cannot fully recover synthesis quality. If your workflow's value is in the synthesis — the contrarian angle, the connection between disparate stories — be honest with yourself about which model you actually need.

Where this leaves me

After three days of prompt tightening, the agent on Llama 3.3 70B works:

  • Format leakage is gone (explicit BEGIN/END markers)
  • HEARTBEAT_OK is emitted cleanly (DEFAULT OUTPUT contract)
  • The dedup is deterministic (Python pipe instead of "find the last 3")
  • Prompt caching is a missing line item I've accepted
  • Synthesis quality is lower, and I'm aware of that

The agent functions. For lower-context, more-procedural tasks (heartbeats, status reports), Llama is fine. For high-synthesis daily briefings, I'm probably going to roll back to Opus or try a stronger NIM model — Llama 3.1 405B, DeepSeek-V3, or Nemotron Super 49B.

The migration is real, the cost savings are real for the right workloads, and NVIDIA NIM is a legitimate option. But the marketing message — "swap your model string and ship" — undersells the engineering cost. Budget three days, not three minutes, and decide per-workload whether the migration is actually a win.

What's next?

A side benefit of this migration: the prompt tightening I did to keep Llama on the rails actually makes the prompts more robust on Opus too. The explicit BEGIN/END markers, the DEFAULT OUTPUT contracts, the deterministic dedup pipes — they all read as belt-and-suspenders to Opus, which doesn't need them. But they future-proof the system if I ever swap models again, and they make the prompts easier to reason about.

If you're contemplating a similar migration: don't start with the full agent. Pick the smallest, most procedural sub-task in your workflow (a heartbeat, a status report, a single-shot generation), migrate that first, and learn the failure modes on a small surface before you take on the high-synthesis workloads.

AILLMsAI AgentsNVIDIA NIMLlamaClaudeAnthropicPrompt EngineeringSoftware Engineering

Enjoyed this article?

Connect with me on LinkedIn for more insights on AI, automation, and full-stack development.