The Problem With Prompt Stuffing
Most text-to-SQL tutorials start with a simple idea: dump your database schema into the LLM prompt. For a toy database with 3 tables, this works fine. For a production database with 50+ tables, 200+ columns, and foreign key relationships — you're burning tokens and confusing the model.
I hit this wall building StellarMIND. The schema context was so large that GPT-4 would hallucinate column names, confuse similar table prefixes, and miss join conditions. The accuracy on complex queries dropped below 40%.
The root cause isn't the LLM — it's the retrieval. When you stuff everything into the prompt, the model has to figure out which 5% of the schema is relevant. That's a needle-in-a-haystack problem that language models are notoriously bad at.
RAG-Based Schema Retrieval
The fix is RAG (Retrieval-Augmented Generation) applied to schema metadata. Instead of sending the entire schema, you:
- Chunk your schema into semantic units — one chunk per table, including column names, types, constraints, and sample values
- Embed each chunk using an embedding model (we use OpenAI's text-embedding-3-small)
- Store embeddings in pgvector alongside the original text
- Retrieve the top-k most relevant chunks based on the user's natural language question
- Generate SQL using only the retrieved context
This means when a user asks "What were last month's sales by region?", the retriever pulls only the sales, regions, and date_dim table schemas — not the entire 200-table dump.
Implementation with Spring AI and pgvector
StellarMIND uses Spring AI's VectorStore abstraction with pgvector as the backend. The setup:
- Embedding model: OpenAI text-embedding-3-small (1536 dimensions)
- Vector store: PostgreSQL with pgvector extension
- Similarity search: Cosine similarity, top-5 retrieval
- Schema chunking: One document per table with columns, types, constraints, and 3 sample rows
The key insight is including sample data in the embeddings. Column names like rgn_cd or amt_usd are cryptic — but when the embedding includes sample values like "US-WEST" or "1,234.56", the semantic search becomes dramatically more accurate.
Spring AI makes this trivially easy:
The retrieved documents go straight into the prompt as context, replacing the full schema dump.
Results: Prompt Stuffing vs RAG
On StellarMIND's test suite of 50 natural language queries against a 47-table database:
| Approach | Accuracy | Avg Tokens Used | Latency | |----------|----------|-----------------|---------| | Full schema in prompt | 38% | ~12,000 | 4.2s | | RAG (top-5 retrieval) | 79% | ~2,800 | 1.8s | | RAG + sample data | 87% | ~3,200 | 2.1s |
RAG more than doubled accuracy while cutting token usage by 75%. Adding sample data in embeddings pushed it to 87% — the cost of those extra 400 tokens per query is negligible compared to the accuracy gain.
Safety: Read-Only Enforcement
Accuracy isn't enough. An LLM generating SQL against your production database needs guardrails.
StellarMIND enforces read-only access at the query level — not just at the database role level. The generated SQL is parsed before execution, and only SELECT and WITH (CTE) statements are allowed. Any INSERT, UPDATE, DELETE, DROP, or ALTER gets rejected before it touches the database.
This is defense-in-depth: even if the LLM hallucinates a destructive query, it never executes.
Key Takeaways
- Don't stuff schemas into prompts — it wastes tokens and kills accuracy on real databases
- pgvector + Spring AI is the simplest RAG stack for Java/Spring Boot teams
- Include sample data in embeddings — column names alone aren't semantically rich enough
- Enforce read-only at the application layer — database roles aren't sufficient for LLM-generated queries
- Measure accuracy on real queries — toy benchmarks don't reflect production complexity