1. Small Models Can Outperform Giants (With the Right Retrieval)
The assumption that larger language models always deliver superior results is wrong. Retrieval quality matters more than model size. A well-engineered retrieval pipeline feeding relevant context to a smaller model can outperform expensive frontier models running on poor retrieval.
The strategic implication is clear: invest engineering resources in robust retrieval systems rather than defaulting to the costliest models available.
2. Similarity Is Not Relevance: The Unseen Power of Re-ranking
Vector search finds mathematically similar content, but this doesn’t guarantee contextual relevance. A query for “Apple phone price” might retrieve fruit-related content because both contain “apple.”
Production systems address this through multi-step processes: broad candidate retrieval, then cross-encoder re-ranking that scores actual relevance before passing results to the language model. This coarse-to-fine approach prevents the model from being confused by irrelevant but mathematically similar information.
3. Your AI Needs a Database, Not Just a Library
Standard RAG systems cannot perform calculations on structured data. A spreadsheet ingested as text chunks cannot answer aggregate questions like “How many students total?”
The solution is Hybrid RAG — routing queries intelligently: vector search for unstructured content, SQL queries for statistical or aggregate questions against traditional databases. The router decides which path to take based on the nature of the query.
4. RAG as a “Learning Mechanism,” Not a Static Knowledge Base
Advanced systems maintain curated example databases. When a new query arrives, the system retrieves relevant historical examples and injects them as few-shot guidance into the prompt.
As the system makes mistakes, engineers add targeted examples to the database, creating a learning loop without code changes. The system improves through curation, not retraining.
5. Scaling Up? You Don’t Need a Bigger Script, You Need an Orchestrator
Linear ingestion scripts fail at scale due to memory issues, file errors, and API rate limits. Professional approaches use orchestrator workflows that manage batches, track job status in queue databases, and include automatic error handlers for failed jobs.
This ensures reliable, high-throughput processing of thousands of documents per hour — the difference between a demo and a production system.
Conclusion
Effective RAG systems require thoughtful architectural decisions — prioritising retrieval quality, ensuring true relevance through re-ranking, integrating hybrid data sources, building feedback loops, and engineering scalable pipelines. These choices distinguish fragile prototypes from production-grade applications.
