Jan 2026·11 min read

building developer tools for llms

LLMsMCPDeveloper Tools

eighteen months ago, "llm tooling" meant an api wrapper with a retry loop and maybe a prompt template string. you'd call openai's completions endpoint, parse the freeform text response with a regex, and pray. that was the stack. the entire thing fit in one file.

today there are dedicated protocols for tool interoperability, structured output guarantees baked into the api layer, agent frameworks with their own state machines and memory systems, and an entire observability ecosystem just for tracing what an llm agent did and why. the tooling around the models has become more interesting than the models themselves. and i think that's where the real leverage point is now, not in the next parameter bump from 400b to 800b, but in the infrastructure that turns a raw model into something you can actually ship.

let me trace how we got here, because the timeline is compressed enough that it's easy to lose track.

early 2023 was the prompt engineering era. the state of the art was "write a better system prompt." people were sharing jailbreaks on twitter and the most sophisticated technique was chain-of-thought: just tell the model to think step by step. the tools were chatgpt and a text box. if you wanted to build something production-grade, you were mostly on your own.

mid 2023 brought the first wave of frameworks. langchain and llamaindex appeared and tried to be the react.js of llm applications. langchain in particular became the default, which was both good (it existed) and bad (it was deeply over-abstracted for what most people needed). the core idea was right though: you need a layer between your application and the raw model api.

late 2023 was function calling. openai shipped it in june 2023, and this was genuinely the most important api change in the whole timeline. instead of the model outputting freeform text that you had to parse, it could output a structured json object describing a function it wanted you to invoke. i'll come back to why this distinction matters so much.

2024 was agents and structured outputs. openai shipped guaranteed json schema conformance in august 2024. anthropic and google followed. agent frameworks matured from "cool demo" to "thing you could maybe put in production if you squinted." the eval ecosystem exploded. and quietly, anthropic published the model context protocol spec.

2025 brought mcp adoption, a cambrian explosion of agent frameworks, and the beginning of real observability tooling. 2026 is shaping up to be about production-grade agent infrastructure, the boring plumbing that actually makes this stuff work reliably.

12,000+

public mcp servers indexed

major agent frameworks in production use

~$0.00

cost per 1k input tokens (gemini flash, cached)

max context window (gemini), tokens

the model doesn't call anything

this is the single most misunderstood concept in llm tooling, and getting it wrong leads to bad architecture decisions. when people say a model "calls a function" or "uses a tool," here is what actually happens: the model outputs a json object. that's it. the json object contains a function name and arguments. your runtime code, not the model, reads that json, executes the actual function, and feeds the result back into the model's context for the next turn.

the model never runs code. it never touches your database. it never makes an http request. it produces text, specifically structured text that describes what it wants to happen. the distinction matters because it means tools are not capabilities the model has. they're descriptions the model reasons about. the model reads a tool's name, its description, and its parameter schema, then decides whether and how to invoke it based on that text. a poorly described tool is a tool the model will misuse, regardless of how well-implemented the actual function is.

tool-call.ts

// what the model actually outputs (it's just text):
{
  "tool_calls": [{
    "id": "call_abc123",
    "function": {
      "name": "query_database",
      "arguments": "{\"sql\": \"SELECT * FROM users WHERE active = true LIMIT 10\"}"
    }
  }]
}

// your runtime does the actual work:
const toolCall = response.tool_calls[0];
const result = await executeTool(toolCall.function.name, toolCall.function.arguments);

// then you feed the result back as a new message:
messages.push({
  role: "tool",
  tool_call_id: toolCall.id,
  content: JSON.stringify(result)
});

// and call the model again with the updated context
const finalResponse = await openai.chat.completions.create({ messages });

this architecture has a consequence that most tutorials gloss over: the tool execution loop is your code. you control what happens between the model's request and the model's next turn. you can validate the arguments, rate-limit calls, inject additional context, log everything, or reject the call entirely. the model is making suggestions. you're the one with the execution runtime. the best llm applications i've seen treat this boundary as the primary control surface, not the system prompt.

mcp is a protocol, not a product

the model context protocol is the most consequential thing anthropic has shipped for the developer ecosystem, and most explanations of it are either too vague ("it's like usb for ai!") or too deep in the spec to be useful. here's the version that would have saved me a few hours.

mcp is a json-rpc 2.0 protocol. it defines a communication contract between three things: a host (the application running the llm, like claude desktop or your custom agent), a client (managed by the host, maintains a 1:1 connection to a server), and a server (a process that exposes tools, resources, and prompts). the server is where your actual functionality lives. it could be a filesystem accessor, a database connector, a github integration, or anything else you want an llm to interact with.

diagram 01 — mcp architecture

fig 01 — the host (your app) manages mcp clients, each connected to one server. servers expose tools via json-rpc. the llm never talks to servers directly; it reasons about tool descriptions and the runtime handles execution.

why does this matter? because before mcp, every integration was bespoke. if you wanted claude to query your database, you wrote a custom tool for claude. if you wanted gpt-4 to do the same thing, you wrote a different custom tool shaped for openai's function calling format. if you wanted to switch models, you rewrote your tools. mcp says: build the server once, describe your tools once, and any compliant host can connect to it. the "usb" analogy is actually pretty accurate. usb didn't make peripherals possible, it made them interchangeable.

the practical impact is already visible. cursor, windsurf, claude code, and other coding tools all support mcp servers. you can install a postgres mcp server and immediately give any of these tools the ability to query your database without writing a single line of integration code. we went from "two weeks to build a custom integration" to "npm install and a config file."

structured outputs changed everything quietly

there's a thing that happened in 2024 that didn't get the attention it deserved. openai, anthropic, and google all shipped some form of guaranteed structured output. meaning: you give the model a json schema, and the model's output is guaranteed to conform to that schema. not "usually conforms" or "conforms if you ask nicely in the prompt." guaranteed, enforced at the decoding level.

before this, the standard approach was: ask the model to output json in the system prompt, hope it does, and wrap the whole thing in a try-catch. production systems had elaborate retry logic for malformed json. i've seen codebases with 200+ lines dedicated to "parse the llm output and handle the seventeen ways it might be broken." all of that evaporated overnight.

but the bigger impact was second-order. guaranteed structured output is what makes reliable function calling possible. if the model can't guarantee valid json, it can't guarantee valid function arguments, which means your tool execution layer needs to handle arbitrary malformed input. once outputs are guaranteed, you can trust the contract between the model and your tools. you can build real systems on top of that contract instead of defensive workarounds.

↳ the real shift

the move from "parse freeform text" to "guaranteed schema conformance" is the single biggest reliability improvement in llm applications. it doesn't get talked about much because it's boring infrastructure. but it's the foundation everything else is built on. function calling, tool use, agent loops, mcp — all of it relies on the model producing valid structured output.

the technique behind this is called constrained decoding. at each token generation step, the model's output distribution is masked so that only tokens which would keep the output valid according to the schema are considered. it's not the model "choosing" to output valid json. it's the runtime making invalid json impossible. the model's probability distribution is filtered before sampling. this means you can give a model a union type schema with five variants and it will always produce exactly one valid variant. you can enforce enum values, required fields, numeric ranges. the schema becomes a contract, not a suggestion.

this matters for agent architectures because it's what makes the tool-call loop deterministic enough to build on. without it, every agent framework would need to be a fuzzy text-parsing mess. with it, you get clean typed interfaces between the model and your code.

agent frameworks: actual architectural differences

the landscape of agent frameworks looks chaotic from the outside. langgraph, crewai, autogen, vercel ai sdk, openai agents sdk, google adk — it feels like a new one appears every month. but they're actually making different architectural bets, and understanding the differences matters for choosing the right one.

langgraph (from the langchain team) models agents as state machines. you define nodes (functions that transform state), edges (transitions between nodes), and a state object that persists across steps. the graph is explicit and inspectable. you can draw the exact flow your agent will follow. this is good when you need deterministic control flow: "first search, then analyze, then draft, then review." the tradeoff is verbosity. a simple agent in langgraph requires defining a typed state class, node functions, edge logic, and a graph compilation step. it's a lot of ceremony for "call the model in a loop."

crewai takes a different approach: role-based multi-agent collaboration. you define agents with roles ("researcher," "writer," "critic"), give each a backstory and goal, then define tasks and assign them. the agents coordinate through a manager agent or sequentially. the abstraction is designed for workflows where you want simulated specialization. the bet is that giving the model a persona produces better output for specific sub-tasks. the evidence for this is mixed but the framework is clean.

autogen (microsoft) models agents as participants in a conversation. the core primitive is a group chat where multiple agents take turns. the routing logic decides who speaks next based on context. it's designed for problems where back-and-forth refinement matters: code generation where one agent writes and another reviews, or research where one agent gathers evidence and another critiques it. the architecture is fundamentally conversational rather than procedural.

vercel ai sdk is doing something different from all of these. it's not really an "agent framework" in the traditional sense; it's a streaming ui toolkit that happens to support tool calling and multi-step generation. the core abstraction is streamUI: the model can output react components as part of its response. when a model calls a tool, instead of returning json, it can return a loading spinner, then replace it with a chart when the data arrives. the bet here is that agents are fundamentally ui products, and the framework should be optimized for that.

diagram 02 — agent architecture patterns compared

fig 02 — the single-agent loop (left) is simple and works for most use cases. state machines (center) add explicit control flow. multi-agent (right) adds specialization through role-based agents that communicate. pick the simplest pattern that works.

here's my actual opinion on this: most applications should use the single-agent loop pattern. one model, a set of tools, looping until it's done. the multi-agent and state machine patterns are genuinely useful for specific problems, but they're dramatically over-applied. i've watched teams build elaborate multi-agent systems with a "planner" agent and an "executor" agent and a "reviewer" agent when a single model with good tools and a clear system prompt would have done the same job with a tenth of the complexity and half the token cost. the coordination overhead of multiple agents is real and usually not worth it unless you're dealing with fundamentally adversarial sub-tasks or need to enforce separation of concerns at the architectural level.

the exception is code generation workflows where having a separate "reviewer" agent that only sees the diff and checks for bugs genuinely catches errors the writing agent misses. that's a real multi-agent win. but it's the exception.

rag, retrieval augmented generation, deserves a mention here because it's one of the most deployed patterns and also one of the most poorly implemented. the basic idea is simple: instead of relying on the model's training data, you retrieve relevant documents from your own corpus and inject them into the context window alongside the user's question. the model answers based on your data, not its memorized data.

the naive implementation, embed everything into vectors, do a cosine similarity search, stuff the top 5 results into the prompt, fails in predictable ways. the embedding model might not capture the right notion of "relevance" for your domain. the chunks might be too big (wasting context) or too small (losing coherence). the top 5 most similar results might all say the same thing, giving you redundancy instead of coverage. and the model might ignore the retrieved context entirely if it conflicts with its pretraining.

what people call "advanced rag" is really just the set of fixes for these failures. re-ranking: use a cross-encoder model to re-score the retrieved results based on actual query-document relevance, not just embedding similarity. hybrid search: combine vector similarity with keyword matching (bm25) because sometimes the user is searching for a specific term and cosine similarity misses it. chunk optimization: experiment with chunk sizes, add overlap between chunks, include metadata like document titles and section headers. query decomposition: break complex questions into sub-queries and retrieve for each one separately. these aren't exotic techniques. they're table stakes for rag that actually works in production.

↳ the observability gap

you can't debug what you can't see. an agent makes 14 tool calls across 6 turns, spends $0.47 in api costs, takes 23 seconds, and returns a wrong answer. which step went wrong? was it the retrieval? the tool arguments? a hallucinated intermediate result? without agent-level observability, meaning traces that show every step, every tool call, every context window state, you're debugging by reading logs and guessing. tools like langsmith, helicone, braintrust, and arize phoenix exist because this problem is real and getting worse as agents get more autonomous. the trend here is clear: agent observability is becoming as important as application performance monitoring was for web services in the 2010s.

there's a vocabulary shift happening that reflects how the field is maturing. people don't say "prompt engineering" much anymore; the preferred term is "prompt programming," which is more honest about what you're doing: writing instructions for a non-deterministic system with control flow. "context window" is no longer just a spec number; it's a resource to manage, like memory in a systems programming context. you budget it. you decide what goes in and what gets evicted. "inference-time compute" is a first-class concept now, meaning you can spend more compute per query (via chain-of-thought, multiple passes, or tool loops) to get better answers, trading cost for quality at serving time rather than training time.

the terminology that matters most, though, is "tool schemas." the description and parameter schema you write for a tool is not documentation. it's the interface the model reasons about. i've seen tool call accuracy jump from 60% to 95% by rewriting a tool's description from "queries the database" to "executes a read-only sql select query against the users table. returns rows as json arrays. does not support mutations. use the filter parameter to add where clauses." the description is the api contract. treat it like one.

if you're building in this space right now, my honest read is that most of the value is in three layers: the tool execution runtime (how you manage the loop between model and tools), the context management layer (what goes into the model's context window and when), and observability (how you know what happened and why). the model itself is increasingly commoditized. gpt-4o, claude sonnet, gemini pro — they're all good enough for most applications. the differentiation is in what you build around them.

there's an irony here that i think about a lot. the companies that will capture the most value from llms are probably not the ones training the models. they're the ones building the infrastructure that makes the models useful: the tool runtimes, the agent frameworks, the observability platforms, the context management systems. most "ai companies" are actually infrastructure companies. the model is the engine, but nobody buys a car for the engine alone. they buy it for the steering, the brakes, the navigation, and the fact that it starts reliably every morning.

we're maybe 30% of the way through figuring out what good llm infrastructure looks like. the tools we're building today will look as primitive to the 2028 version of this ecosystem as curl-based api calls look to us now. but the problems we're solving, how to give models reliable access to the outside world, how to guarantee structured communication between systems, how to observe and debug non-deterministic multi-step processes, those problems are permanent. the specific tools will change. the categories won't.

the question i keep coming back to: if the model is the commoditized layer and the infrastructure is the differentiated layer, why is everyone still obsessed with benchmarks?