Ask your database in English, locally

A local-only consulting demo that shows Australian businesses they can run an AI assistant against their own data with zero data leaving the machine. We demonstrate two variants of the same underlying idea: using a locally run LLM to query a local relational database with natural language. Each version has different performance and accuracy characteristics.

Notes

This is a demo, not a product. It is not designed for production use, and it has not been tested for security, robustness, or edge cases. It is a sketch of what is possible with current technology, and a starting point for discussion about how to build real-world applications that use LLMs to query databases.

We use Qwen 2.5-14B-Instruct at Q4 quantisation for the LLM, served by llama-server on the same machine as the database. This model is particulary suidted to the task because of its strong instruction-following capabilities and good performance at writing SQL. In addition, it allows a wide context window that can accommodate both the user's question and the returned rows for post-processing. The database is a SQLite file seeded with fictional Queensland property-management data.

1. Direct SQL, raw data response

The simplest path. The LLM reads the question, emits one or more SQL statements, the app runs them read-only against SQLite, and the rows are rendered in a table. The user reads the table.

This version is fast — one LLM call in the whole loop. The cost is that raw rows are not always an answer. For some questions the table is exactly what was asked for; for others the user has to do the interpretation themselves.

The App brokers everything. The LLM writes SQL; the database speaks only to the App.

In practice it looks like this. The model is Qwen2.5-14B-Instruct at Q4 quantisation, served by llama-server on the same machine; the database is a SQLite file seeded with fictional Queensland property-management data.

Two queries from the demo. Latency varies with query complexity: 12.9 s for a multi-table join (tenant + pet), 3.9 s for a simple range filter.

Result-set size barely changes things. A query that returns 100 rows comes back in 5.3 s — only marginally slower than the single-row queries above, because the LLM's job (writing one SQL statement) is the same regardless of how many rows it ends up matching.

Top and tail of the same 100-row response. 5.3 s end-to-end. The LLM writes one SQL statement; SQLite does the rest.

2. Tool-calling, no SQL written

The LLM is not asked to write SQL at all. Instead it is given a small set of typed tools, exposed by a layer the demo calls the Domain Tool Engine — functions that wrap the database with named parameters, validation, and a constrained surface. The model decides which tool to call and with what arguments; the engine runs the actual query and returns structured results.

This shifts the contract. The model no longer has to get SQL syntactically and semantically right; it only has to pick a tool and fill in its arguments. In practice this is the most accurate of the three — the engine's schema does the work the model would otherwise have to do, and the surface area for hallucinated table or column names collapses.

The Domain Tool Engine sits between the App and SQLite. The LLM produces tool calls; the engine is the only thing that writes SQL.

The downsides of this approach become immediately visible. Consider this example:

Same question as before. 1.5 s — but the answer drops the “animal” half of the question.

What we gained in accuracy we lose in flexibility. The engine selects the tool that is most suitable to the task, and simultaneously has to make compromises — here, a tool that returns property and tenant data, with no slot for the pet, so the pet quietly drops out of the answer.

So we try again, this time being a little more explicit:

2.2 s. The model invokes a separate pet-records tool — but selects a property that has no pet, so the lookup returns nothing.

This time the model attempts to get information about the pet by invoking a different function, but it omits to set the pet constraint — so a property is returned that does not actually have a pet. This immediately exposes a new limitation: the model is not trained for the specific tool-calling required for our database. It can pick tools, but it does not know that “with an animal” should restrict the property selection to ones with a pet record on file. The constraint is a domain rule, and the off-the-shelf model has no way to know it.

Where each one fits

Version	Latency	Scales with result size	Accuracy	Question shape
Direct SQL, raw rows	Fast	Yes (5.3 s for 100 rows)	Depends on the SQL	Anything the schema can answer
Tool-calling	Fastest on small queries (1.5 s)	Yes	Highest, by construction	Only what the engine has tools for

Summary

Two viable patterns came out of the demo. They are not equivalent, and neither is shippable on its own.

Direct SQL, raw rows trades narration for flexibility. The LLM only has to write a single SQL statement, which keeps it fast at any result-set size — 100 rows came back in 5.3 s — and lets the user ask anything the schema can answer. The cost is that the result is a table; turning that into a sentence is an exercise for the reader.

Tool-calling via the Domain Tool Engine trades flexibility for accuracy. Hallucinated columns are impossible by construction, and small queries return in 1–2 s. The cost is that the model can only ask the questions the engine has tools for, and the questions it asks are bounded by what the model already knows about the schema — which is, currently, nothing specific to our database.

Where this would go next

Agentic loops. Both patterns above are single-shot: the LLM gets one chance to write the SQL, or one chance to pick a tool. An agentic version — multiple LLM calls, the model checking its own work and retrying when the result does not match the question — would address both kinds of failure on the page. In direct SQL it would catch a query that returned nothing useful and revise it; in tool-calling it would chain tools correctly, picking a property that has a pet before looking up the pet record.
More capable models, in a sovereign region. Agentic loops mean more LLM calls per question, and that is the cost. A 14B model running locally on an M1 starts to feel slow under multi-step reasoning — we ran experiments that confirmed this; they are not shown here. The natural next step is a more capable model hosted in a region-locked cloud environment: an Australian-resident hyperscaler region, or a sovereign AI provider. The “no data leaves the machine” purity goes; the “no data leaves the country” property is preserved, and the model has the headroom to plan, retry, and chain tool calls without timing out.