RAG: How to (Accidentally) Vacuum Your OpenAI Account

29.01.2026

Not long ago I wrote about different approaches to semantic search. In this post I want to highlight one of the most common—and surprisingly easy—ways to burn money in a Retrieval-Augmented Generation (RAG) pipeline: sending too much data to the model.

A minimal Agents SDK setup

Here is a simplified Python snippet that wires a semantic search agent to a retrieval function. There’s nothing magical about it—the key is the tools list. In this example, graph_rag_search is the tool that “fetches” relevant rows/chunks from your database, and the LLM then turns that context into the final answer.

search_agent = Agent(
    name="Semantic search agent",
    instructions=_load_system_prompt(),
    input_guardrails=[main_guardrail],
    model=AI_MODEL_TYPE,
    tools=[
        graph_rag_search,
    ],
    model_settings=ModelSettings(
        store=True,
        reasoning=Reasoning(
            effort=AI_EFFORT,
            summary="auto",
        )
    )
)

This is exactly where developers often get burned: the retrieval tool may return a payload that is far too large. Even if the model doesn’t end up “using” every token, those tokens still count toward your bill.

Prune the payload before it reaches the model

A simple and effective pattern is to reduce the retrieved records before you pass them to the LLM. Remove fields the model does not need (embeddings, full-text vectors, large metadata blobs) and hard-cap the chunk length.

def _prune_chunk_payload(row: dict[str, Any], *, max_chunk_chars: int) -> dict[str, Any]:
    cleaned = dict(row)
    cleaned.pop("embedding", None)
    cleaned.pop("fts_vector", None)

    chunk_value = cleaned.get("chunk")
    if (
        isinstance(chunk_value, str)
        and max_chunk_chars > 0
        and len(chunk_value) > max_chunk_chars
    ):
        cleaned["chunk"] = f"{chunk_value[:max_chunk_chars]}…"
    return cleaned

This single helper can save you a lot of money—especially when your retrieval layer returns verbose chunk payloads, wide rows, or unbounded text fields.

Log token usage per request (especially in streaming)

Pruning is step one. Step two is visibility. If you don’t measure token usage, you will eventually ship a change that multiplies your spend and you won’t notice until the invoice arrives.

If you use streaming responses, capture usage from the response.completed event:

async for event in streamed.stream_events():
    ...
    if getattr(event, "type", None) == "response.completed":
        resp = getattr(event, "response", None)
        usage = getattr(resp, "usage", None)
        if usage is not None:
            if hasattr(usage, "model_dump"):
                final_usage = usage.model_dump()  # type: ignore[attr-defined]
            elif isinstance(usage, dict):
                final_usage = usage
            else:
                final_usage = {"usage": str(usage)}

if final_usage:
    # Typical keys: input_tokens, output_tokens, total_tokens
    logging.info("Total usage: %s", json.dumps(final_usage, default=str))
else:
    logging.warning("No usage info captured from streamed events.")

A typical log line looks like this:

Total usage: {
  "input_tokens": 10480,
  "input_tokens_details": {"cached_tokens": 5504},
  "output_tokens": 1202,
  "output_tokens_details": {"reasoning_tokens": 283},
  "total_tokens": 11682
}

From tokens to money: compute the request cost

Once you have per-request usage, the cost calculation is straightforward—assuming you know the current pricing for your model.

FYI: in this example I used gpt-5.2 with the following pricing per 1M tokens (USD): Input 1.75, Cached input 0.175, Output 14.00.

For the GraphRAG run above:

(10480 * 1.75 + 5504 * 0.175 + 1202 * 14) / 1000000 = 0.0361312

In other words, that single query cost roughly $0.036 (just under 4 cents) using a GraphRAG-style retrieval method.

For an RRT run, the usage looked like this:

Total usage: {
  "input_tokens": 10625,
  "input_tokens_details": {"cached_tokens": 0},
  "output_tokens": 1552,
  "output_tokens_details": {"reasoning_tokens": 637},
  "total_tokens": 12177
}

And the cost:

(10625 * 1.75 + 0 * 0.175 + 1552 * 14) / 1000000 = 0.04032175

That comes to about $0.040—again, roughly 4 cents. The similarity is not surprising: both retrieval functions end up using a comparable delimiter/data packaging strategy, and the output token volume is in the same ballpark.

Don’t forget the “hidden” costs

The total request cost is not just input + output. In most real systems you should also account for:

  • Guardrail processing (input moderation, safety filters, policy checks)
  • Query vectorization (embedding the user query)

In practice, these are typically an order of magnitude smaller than the main completion cost—but they are not zero, and at scale they matter.

One more lever: use a cheaper model

The most direct way to reduce cost is to switch to a cheaper model—for example gpt-5-mini. In many workloads this reduces per-query cost by more than 10×. The tradeoff is predictable: output quality may drop somewhat, and you may need better prompting, tighter pruning, or more aggressive retrieval constraints to keep accuracy acceptable.

Takeaway

One of the biggest pitfalls in RAG is not correctness—it’s cost. If you let retrieval return large payloads and you don’t measure usage, you can “vacuum” your OpenAI budget astonishingly fast.

Prune the payload, log usage per request, and make model choice a conscious design decision. Your future self (and your invoice) will thank you.