Anatomy of an Agent: What's Actually Inside a Working LLM Agent

Most "AI agents" you read about online are a chat loop with a system prompt, a couple of tools, and a Twitter thread. The ones I run in production look nothing like that.

I've been operating a multi-agent harness for the better part of a year — bookkeeping for my household, payroll/HR for a paying client, content drafting for this blog, and a self-ops agent that watches the box the others run on. Each of these is a real working system. None of them is a single file. None of them is a single prompt.

This post walks through what's actually inside one of mine — the file family, the skill bundle, the cron contract, the channel binding — and then uses a recent academic paper to say something honest about why the structure looks the way it does.

The example throughout is the household-bookkeeping agent: it talks to my parents over WhatsApp in three languages and writes everything to a SQLite database. It's been live for months, has been corrected by my mother dozens of times, and broke in interesting ways enough times that I have something to say.

The file family

Before any clever architecture, an agent is a directory of Markdown.

When the harness boots up the household agent, it reads a workspace folder that looks like this:

household-agent/
├── AGENTS.md          # purpose, response/silence rules
├── IDENTITY.md        # who this agent is
├── SOUL.md            # voice, personality
├── USER.md            # operator + family context
├── MEMORY.md          # long-term curated memory
├── TOOLS.md           # capability inventory
├── HEARTBEAT.md       # cron contract
├── skills/            # skill library (one folder per skill)
├── memory/            # ephemeral state (heartbeat dedup, etc.)
└── sessions/          # jsonl conversation history

Each Markdown file does exactly one job. AGENTS.md answers "what should this agent do, and when should it stay silent?" — important when the agent is sitting in a family WhatsApp group where most messages are not for it. IDENTITY.md and SOUL.md handle persona: this particular agent has a feminine name and replies in feminine Marathi-Roman when the user writes in Hindi. USER.md has the operator-and-environment context — who the family members are, what their phones are, which one talks in Marathi-Roman vs. Hindi vs. English.

MEMORY.md is the one that actually matters in the long run. It's a hand-curated document the agent appends to itself when something durable happens — a pattern the family follows, a category that should never default to Miscellaneous, the fact that two names that show up in our payment history are actually the same person (one's the UPI display name, the other is what the family calls her). I'll come back to this.

TOOLS.md is the capability inventory, but it's not the actual tool wiring — that lives in auth-profiles.json and the gateway config. TOOLS.md is the prose version: "you have an exec tool, here's how to use it sanely, here are the things you should never run."

HEARTBEAT.md is the cron contract. The agent has scheduled jobs that fire on a schedule, and each one is supposed to either return a single literal token (HEARTBEAT_OK) — which the runtime suppresses from delivery — or surface a real message to the user. The whole file is rules about that boundary: if no real signal, emit HEARTBEAT_OK alone. If something's off, emit the message and only the message. Every word the agent writes goes to a real human.

skills/ is a library of bundled procedural recipes. Each skill is its own folder with a SKILL.md and any reference data it needs. We're going to dissect one of these in a minute.

The two operational dirs — memory/ and sessions/ — are state, not config. memory/heartbeat-state.json tracks "did I already nag the user about X in the last 48 hours" so we don't spam. sessions/ is the JSONL transcript of every conversation, including tool calls and results. When a session goes wrong, you debug by reading these files.

The wiring layer

The Markdown bundle describes what the agent is. A separate config layer says what it can connect to.

auth-profiles.json binds the agent to a model. For the household agent that's currently Claude (a single profile pointing at Anthropic's API). For another agent of mine, the brand-manager, the same file points at a local Ollama instance running on my desktop GPU. Same agent shape, different model — that's a one-line swap, not a refactor.

openclaw.json (gateway-level) and jobs.json (cron jobs) sit one level up and handle plumbing across all agents at once. jobs.json is where every cron lives — its prompt, schedule, target agent, delivery rules, and a lastRunAtMs field the runtime reads on boot so it can decide whether to fire any missed jobs immediately. (I learned the hard way that setting lastRunAtMs: 0 for a new cron triggers an immediate firing on next startup, which is the kind of thing you want to know before you seed three new jobs at once.)

The channel layer is extensions/. For this agent, that's extensions/whatsapp/, which handles inbound message parsing (Baileys protocol, group context, reply quotes, media), outbound message sending, and a translation layer that turns "the user just sent an image" into a structured prompt the agent can reason about. Other agents bind through different extensions — extensions/bluebubbles/ for iMessage, a web extension for browser-side interaction.

Put it together: the agent is a Markdown bundle, the wiring is JSON, the channel is an extension, the runtime is a process that holds it all together and serializes the conversation. If any of these go missing, you don't have an agent — you have a chatbot.

A skill, dissected

The single best decision I made early was not to dump everything into AGENTS.md. Anything that's procedural — "when X happens, do Y in this specific order" — went into skills/.

Here's the frontmatter of one of mine, the bookkeeper skill that handles every UPI screenshot, expense, and salary payment my family sends in:

---
name: family-bookkeeper
description: Manage household finances. Use for recording expenses,
  tracking income, managing loans/EMIs, setting budgets, and generating
  monthly reports. Triggers on "expense", "spent", "kharcha", "paid",
  "income", "salary", "emi", "loan", "budget", "report", "monthly",
  "balance", "swiggy", "zomato", "zepto", "uber", "petrol".
---

Three things to notice. First, the description field is doing real work. It's not a human-readable summary — it's the routing signal. The harness reads the name + description of every available skill, and the model decides which to invoke based on whether the trigger words match the user's message. If the description is sloppy, the wrong skill fires. If it's too narrow, the right skill doesn't fire. The list of trigger keywords ("swiggy", "zomato", "zepto", "uber", "petrol") wasn't in the first version of this skill — those were added one at a time after the agent kept missing what was clearly an expense because the user said "swiggy 450" instead of "expense ₹450 on Swiggy."

Second, the trigger list is biased toward Hindi-Roman. kharcha (खर्चा, "expense") is in there because that's what my mother actually types. Every word in this list is a word someone in the family has actually used.

Third — and this is the architectural point — the body of the skill is a five-step workflow:

Extract the amount, recipient, date, payment method, payer from the image.
Categorize the merchant against a 41-category list.
Determine the family member from payer hints.
Call the API — POST /expenses with the structured payload.
Confirm to the user with a single short message.

Each step has its own subsection, with examples and edge cases. The skill ends with hard rules: never describe an image without recording, never say "I will record this" without the API call, never log duplicates when the same screenshot is sent twice.

The skill is currently 412 lines of Markdown.

What got added the hard way

The first version of this skill was about 80 lines. It had the five steps, the API call, and a small category map. Within two weeks it had failed in three ways I hadn't anticipated:

The agent confidently described what was in a UPI screenshot and did not call the API — because the prompt asked for description, not for action. Fix: an explicit "always call POST /expenses first, confirm to user second" rule.
My mother sent the same screenshot twice (WhatsApp likes to do this). Two identical entries in the database. Fix: a check-recent-entries step before recording.
On a recent afternoon, my mother sent a UPI screenshot for ₹6,100 — the recipient's UPI display name didn't match any known merchant, so the agent recorded it as "Miscellaneous." Then she replied to the screenshot with text explaining the payment was actually our maid's April salary. Annotation, not a new event. The WhatsApp adapter doesn't surface reply-quote context to the prompt, so the agent treated it as a new transaction, asked "how much?", got "₹6,100 paid", and recorded a second row. Same payment, two entries.

The fix for the third one wasn't elsewhere in the system — it was a 30-line section added to the skill itself, called the "recent-transaction guard." The rule is: before recording any text-only transaction (no fresh image), look at the last 2 entries by the same payer in the last hour. If the new message could plausibly be annotation on an existing entry, ask before recording. Stay silent until you know.

I committed it the same day. The duplicate is gone, the agent now correctly asks "I just recorded ₹6,100 — is your message describing that payment, or a separate one?" That fix is now part of the skill forever.

The pattern is: every section in this 412-line file is a scar. Each one is a thing that broke once and that I never want to debug again.

What the academy says

In late February 2026, a paper appeared on arXiv with the title SoK: Agentic Skills — Beyond Tool Use in LLM Agents (Jiang et al., arxiv:2602.20867). It's a "Systematization of Knowledge" — academic-speak for "we surveyed the field and built a vocabulary."

The paper's formal definition of a skill is a 4-tuple:

S = (C, π, T, R) — applicability predicate, executable policy, termination condition, reusable callable interface.

That's a clean way to say what a skill is: it knows when it applies (C), it has a procedure (π), it knows when it's done (T), and it has a callable signature so other things can invoke it (R).

If you look at the bookkeeper skill above through this lens:

C (applicability): the description's trigger word list. "expense", "kharcha", "swiggy" — these are the predicate.
π (policy): the five-step workflow body. Extract → Categorize → Determine member → Call API → Confirm.
T (termination): the "send one short confirmation, then stop" rule at step 5.
R (callable): the Markdown file itself, surfaced to the model with name + description as the API.

So the academic frame fits, which is good. But the part of the paper that hit hardest is the empirical result. They benchmarked agentic systems with three setups: no skills, hand-curated skills, and self-generated (auto-discovered) skills. The numbers:

Hand-curated skills: +16.2 percentage points average task success over the no-skills baseline.
Self-generated skills: −1.3 percentage points — worse than no skills at all.

That delta is the entire industry argument. If you let an LLM auto-generate its own skill library, it makes you marginally worse on average. If you hand-curate, you get a real lift.

Read the bookkeeper skill again with that number in mind. Every keyword in the trigger list, every step of the workflow, every safety rule, every entry in the recent-transaction guard — these were added by me, after a real failure, in response to a real user. They were not generated. The version where the agent generates its own checklists and edge cases is the version that loses 1.3 points.

This is also why my MEMORY.md matters. When my mother told me two names from our payment history belonged to the same person, that fact went into MEMORY.md — not because the agent figured it out, but because I typed it in. The agent now knows. It will know tomorrow. It will know in a year. That's the +16.2 lift compounding.

What I'd tell someone building their first agent

A few things, in rough order of how often I see people skip them.

The system prompt is the product. Frameworks come and go. The thing that determines whether your agent works is whether the prompt is good. Spend time on the Markdown. Read your sessions. Find the moments the agent drifted. Tighten the prompt in response. Repeat.

Curate, don't generate. Resist the urge to have the agent write its own skills, its own memory, its own rules. The +16.2 vs −1.3 number is real. Auto-generated procedural memory is regression masquerading as automation. By all means, let the agent suggest — but you sign off, and you write the line that goes into the file.

Skills are a calling convention, not a cognitive unit. The interesting design problem is not "what's the policy" — the model is the policy. It's "what does the name and description need to look like so that this skill fires when it should and stays quiet when it shouldn't." That's prompt engineering at the routing layer, and it's where I've spent the most time iterating.

Read your sessions. I cannot stress this enough. Every JSONL transcript is a record of where the agent reasoned correctly and where it didn't. The duplicate-salary-entry fix didn't come from running benchmarks. It came from reading one specific session and seeing the exact moment the agent asked "how much?" when it should have asked "is this annotation?"

The artifact is the asset. Your prompts, your skill bundle, your MEMORY.md — these are not throwaway. They're the version of your agent that exists. They will outlive every framework choice you make and every model you swap. Treat them like code: version them, diff them, comment the surprising parts.

Closing

The honest version of "what's an AI agent" is: a Markdown bundle, a JSON wiring file, an extension that connects it to a human, a runtime that serializes the conversation, and a skill library you've curated by hand from your own scars.

There's no glamorous architecture diagram. There's just a directory of files that you keep editing every time something breaks. After enough months of that, you have something that works.

The paper says the same thing in equations. I'd rather show you the file.