AI Engineering

RAG vs SQL: Why LLMs Shouldn’t Do Math

June 8, 2026·4 min read

A client once asked their new RAG assistant a simple question: "what's the average order value for our top customer segment this quarter?" The assistant answered instantly, with a clean, specific-looking number. It was wrong, not in an obvious way, but in the quiet way that almost ends up quoted in a board deck before anyone double-checks it.

We caught it during an unrelated review of the eval set. An hour of digging later, the cause was clear. The retriever had pulled back the dozen most relevant chunks it could find. The model had dutifully averaged the prices in those twelve chunks and called it a day. The real dataset had over four hundred matching orders. Nobody had told the model it was looking at a sliver of the picture.

That's the failure mode nobody warns you about when they hand you a RAG tutorial.

What retrieval is actually good at

RAG earns its keep when the answer already exists as text somewhere in the knowledge base, and the job is simply to find it, read it, and restate it clearly:

"What does our refund policy say about international orders?"
"Which support tickets describe this exact error pattern?"
"Summarize the key terms in this contract."

In each case, a human already did the reasoning when they wrote the source document. The model's job is comprehension, not computation. That distinction is the whole story.

Retrieval finds text and explains it. Arithmetic over that text is a different job entirely, and asking an LLM to do both at once is where the wheels come off.

The moment a question requires aggregating, ranking, or counting across a full dataset, retrieval stops being the right tool. It tends to fail in the same three ways, every time.

Sample bias. The model computes over whichever top-K chunks the vector index happened to surface, then presents that small subset as if it were the whole population.

Phantom precision. A number like "$47.23" reads as authoritative. The model has no way to flag that it was computed from eight documents instead of the four hundred rows actually sitting in the table.

Aggregation drift. Ask for "total revenue last quarter" and the model may simply sum whatever dollar figures it sees across a handful of retrieved documents, some of which could be projections, discounts, or single line items from larger invoices. The result has no real relationship to the true total.

Letting the database do what it's good at

The fix isn't a smarter prompt. It's a different architecture, one where the LLM never tries to compute anything itself. Instead, it translates the question into a query and hands the arithmetic to the system built for exactly that job:

User question (natural language)
        ↓
LLM parses intent → generates a query
        ↓
Database executes the query → exact result
        ↓
LLM explains the result in plain language

Applied to the order-value question, the model doesn't try to average anything. It writes:

SELECT AVG(order_value) AS avg_value, COUNT(*) AS n
FROM orders
WHERE segment = 'top_tier'
  AND placed_at >= date_trunc('quarter', now());

The database returns an exact figure over every matching row, and the COUNT(*) gives the client something a retrieval pipeline never could: a defensible answer to "average over how many orders, exactly?" It's auditable, too. If a number looks off, the SQL is right there to read. A retrieval-then-generation chain leaves no comparable paper trail.

Most real systems need both

Treating "RAG or SQL" as a binary choice is a useful way to explain the problem, not a useful way to build the system. A request like "show me reviews for our five best-selling products last quarter" genuinely needs both halves: a query to rank products by sales, and retrieval to surface the actual review text for whichever five products come back.

Order matters here. The structured query runs first, to resolve which entities are even in scope. Retrieval runs second, scoped to those specific IDs. The decision rule we use when designing these systems for clients is short enough to keep on a sticky note:

Aggregation, ranking, or counting. Query the database directly.
Interpretation, summarization, or matching on meaning. Retrieve and generate.
Both at once. Let the query narrow the scope first, then retrieve within it.

A note on tooling

If you're building this yourself, a few tools are worth knowing. Open-source text-to-SQL frameworks like Vanna AI work well for prototyping against a known schema. Once more than a couple of people are asking questions against the same metrics, a semantic layer such as Cube, dbt's Semantic Layer, or MetricFlow earns its setup cost: it defines what "revenue" or "active user" means exactly once, instead of leaving every prompt to redefine it slightly differently. Off-the-shelf chains like LangChain's SQLDatabaseChain are fine for a demo and tend to break down fast against any schema with real complexity.

The pattern we see teams fall into most often: they hand-roll a natural-language-to-SQL pipeline, hit the first gnarly schema edge case, and bolt on retry logic instead of stepping back and introducing a semantic layer that would have prevented the ambiguity in the first place.

LLMs are excellent at understanding intent and explaining results in plain language. They are not calculators. A production system should never ask them to be one.