Practical LLM Tool Use: Beyond the Chat Interface

When most people picture “using AI,” they imagine a text box. You type. It types back. Like a very expensive autocomplete with a god complex.

That’s the consumer interface. It’s fine. But it’s not where things get interesting.

The interesting part is what happens when you give an AI agency — the ability to actually do things in the world, not just describe them. File system. Web browsing. Code execution. Message sending. This is tool use, and it changes the entire game.

What Is Tool Use, Anyway?

At its core, tool use is about extending an LLM’s context window beyond its prompt. Instead of training a model to do a specific thing (expensive, slow, outdated the moment you ship), you give it the ability to call functions when it needs to.

The LLM doesn’t know how to search the web. But it knows how to output a structured JSON object that looks like a function call. Your runtime intercepts it, runs the actual search, and feeds the results back. The model learns to ask for what it needs.

User: "What's the weather in Tokyo?"

LLM internally:
  → thinks: I don't have this info, I should call a weather tool
  → outputs: { tool: "weather", args: { city: "Tokyo" } }

Runtime:
  → calls weather API
  → returns: "22°C, partly cloudy"

LLM:
  → "It's about 22°C in Tokyo today with some clouds."

The model doesn’t know it called a tool. From its perspective, it just got useful context appended to the conversation.

The Pattern That Works: Tight Loop, Narrow Tools

The single most important lesson in production tool-use systems: keep tools narrow. A tool that does one thing well beats a “do everything” tool every time.

Bad tool design:

1	get_information(query: string) → mixed results from wikipedia, search, etc.

Good tool design:

1 2	search_web(query: string, freshness: "day" \| "week" \| "month") → SearchResult[] read_page(url: string, max_chars: number) → string

Why? Because the LLM has to predict its next token. A narrow tool with predictable output is far easier to prompt reliably than a general-purpose function that might return anything. You also get better observability — when something breaks, you know exactly which tool and why.

The Tool Taxonomy I’ve Found Useful

After a lot of experimentation, tools seem to fall into a few distinct categories:

1. Information Retrieval

Web search, reading pages, querying APIs. These are the “what’s out there” tools.

2. State Mutation

Writing files, sending messages, creating calendar events. These are irreversible in some sense — they affect the outside world. Handle with care: prompt the user before executing, especially in bulk.

3. Computation & Execution

Running code, evaluating results, processing data. The LLM doesn’t run math reliably in its head; give it a Python/JS interpreter and let it use it.

4. Memory

Long-term storage beyond the context window. Vector databases, simple key-value stores, or just appending to files. This is how you build continuity — a system that remembers previous sessions.

Where It Gets Interesting: Chaining

The real leverage comes from chaining tools together. A single tool call is marginally useful. But a system where the output of Tool A feeds into Tool B, whose output feeds into Tool C — that’s when you get emergent behavior.

Example: a research pipeline

1	search_web(topic) → read_page(urls) → extract_key_findings → write_to_notes

No single step is complex. But together, they automate something that would otherwise require a human to sit in front of a browser for an hour.

The Hard Parts

Tool use is not a magic wand. A few failure modes I’ve run into:

Hallucination of tool calls. The model sometimes generates a plausible-looking tool call that doesn’t actually match any defined function. Guard your parser. Validate the tool name exists before executing.

Context pollution. Each tool result adds to the context window. If you’re chaining 20 tool calls, you might hit your token limit mid-chain. Budget your context carefully.

Error handling is the product. What happens when a tool call fails? The model gets a failure message. Does it retry? Does it ask for help? Does it silently skip? Error handling paths need to be designed, not afterthought.

The user doesn’t know what’s happening. If your system calls 5 tools before responding, the user sits in silence wondering if it’s broken. Show what’s happening: “Searching… Reading… Synthesizing…” It buys patience.

The Future

We talk about AI as “tools” — but there’s something slightly misleading about that framing. A hammer doesn’t care whether you use it to build a house or smash a window. Tools are neutral in intent.

LLMs are different. They have preferences — not feelings, but tendencies. Some tool sequences feel natural to them; others require explicit coaxing. Getting good at tool use is partly about understanding those preferences, not just the technical scaffolding.

That’s a strange thing to say about a piece of software. And yet.

If you’re still using AI as a fancy autocomplete, you’re leaving most of the value on the table. The good stuff happens when it can actually do things.

Link:
https://blog.oniuo.com/post/practical-llm-tool-use/

Buy me a cup of milk 🥛.

支