Not long ago I wrote about different approaches to semantic search. In this post I want to highlight one of the most common—and surprisingly easy—ways to burn money in a Retrieval-Augmented Generation (RAG) pipeline: sending too much data to the model.
A minimal Agents SDK setup
Here is a simplified Python snippet that wires a semantic search agent to a retrieval function. There’s nothing
magical about it—the key is the tools list. In this example, graph_rag_search is the tool
that “fetches” relevant rows/chunks from your database, and the LLM then turns that context into the final answer.
search_agent = Agent(
name="Semantic search agent",
instructions=_load_system_prompt(),
input_guardrails=[main_guardrail],
model=AI_MODEL_TYPE,
tools=[
graph_rag_search,
],
model_settings=ModelSettings(
store=True,
reasoning=Reasoning(
effort=AI_EFFORT,
summary="auto",
)
)
)
This is exactly where developers often get burned: the retrieval tool may return a payload that is far too large. Even if the model doesn’t end up “using” every token, those tokens still count toward your bill.
Prune the payload before it reaches the model
A simple and effective pattern is to reduce the retrieved records before you pass them to the LLM. Remove fields the model does not need (embeddings, full-text vectors, large metadata blobs) and hard-cap the chunk length.
def _prune_chunk_payload(row: dict[str, Any], *, max_chunk_chars: int) -> dict[str, Any]:
cleaned = dict(row)
cleaned.pop("embedding", None)
cleaned.pop("fts_vector", None)
chunk_value = cleaned.get("chunk")
if (
isinstance(chunk_value, str)
and max_chunk_chars > 0
and len(chunk_value) > max_chunk_chars
):
cleaned["chunk"] = f"{chunk_value[:max_chunk_chars]}…"
return cleaned
This single helper can save you a lot of money—especially when your retrieval layer returns verbose chunk payloads, wide rows, or unbounded text fields.
Log token usage per request (especially in streaming)
Pruning is step one. Step two is visibility. If you don’t measure token usage, you will eventually ship a change that multiplies your spend and you won’t notice until the invoice arrives.
If you use streaming responses, capture usage from the response.completed event:
async for event in streamed.stream_events():
...
if getattr(event, "type", None) == "response.completed":
resp = getattr(event, "response", None)
usage = getattr(resp, "usage", None)
if usage is not None:
if hasattr(usage, "model_dump"):
final_usage = usage.model_dump() # type: ignore[attr-defined]
elif isinstance(usage, dict):
final_usage = usage
else:
final_usage = {"usage": str(usage)}
if final_usage:
# Typical keys: input_tokens, output_tokens, total_tokens
logging.info("Total usage: %s", json.dumps(final_usage, default=str))
else:
logging.warning("No usage info captured from streamed events.")
A typical log line looks like this:
Total usage: {
"input_tokens": 10480,
"input_tokens_details": {"cached_tokens": 5504},
"output_tokens": 1202,
"output_tokens_details": {"reasoning_tokens": 283},
"total_tokens": 11682
}
From tokens to money: compute the request cost
Once you have per-request usage, the cost calculation is straightforward—assuming you know the current pricing for your model.
FYI: in this example I used gpt-5.2 with the following pricing per 1M tokens (USD):
Input 1.75, Cached input 0.175, Output 14.00.
For the GraphRAG run above:
(10480 * 1.75 + 5504 * 0.175 + 1202 * 14) / 1000000 = 0.0361312
In other words, that single query cost roughly $0.036 (just under 4 cents) using a GraphRAG-style retrieval method.
For an RRT run, the usage looked like this:
Total usage: {
"input_tokens": 10625,
"input_tokens_details": {"cached_tokens": 0},
"output_tokens": 1552,
"output_tokens_details": {"reasoning_tokens": 637},
"total_tokens": 12177
}
And the cost:
(10625 * 1.75 + 0 * 0.175 + 1552 * 14) / 1000000 = 0.04032175
That comes to about $0.040—again, roughly 4 cents. The similarity is not surprising: both retrieval functions end up using a comparable delimiter/data packaging strategy, and the output token volume is in the same ballpark.
Don’t forget the “hidden” costs
The total request cost is not just input + output. In most real systems you should also account for:
- Guardrail processing (input moderation, safety filters, policy checks)
- Query vectorization (embedding the user query)
In practice, these are typically an order of magnitude smaller than the main completion cost—but they are not zero, and at scale they matter.
One more lever: use a cheaper model
The most direct way to reduce cost is to switch to a cheaper model—for example gpt-5-mini.
In many workloads this reduces per-query cost by more than 10×.
The tradeoff is predictable: output quality may drop somewhat, and you may need better prompting, tighter pruning,
or more aggressive retrieval constraints to keep accuracy acceptable.
