The 2024 model generation didn’t just get bigger — it started thinking differently. OpenAI’s o1 family, Anthropic’s extended-thinking Claude, and DeepSeek-R1 have given practitioners something genuinely new: models that spend real compute time deliberating before they answer. For enterprises, the interesting question is not whether this is impressive. It’s which of your real workloads this actually changes.
At the same time, the pattern we call RAG has quietly grown up. “RAG 2.0” isn’t a new technique so much as a maturation of the old one — hybrid retrieval, reranking as a first-class concern, agentic retrieval loops, and evaluation that actually measures retrieval quality rather than hand-waving at it. The teams shipping reliable knowledge-grounded systems in 2025 will look nothing like the teams that shipped the first RAG demos in 2023.

Five shifts the 2024 model generation forces on your 2025 plan:
- Model routing becomes a product, not a detail. Reasoning models are expensive and slow — minutes per response, sometimes. Routing a request to the right model tier based on what the task actually needs is no longer an engineering indulgence; it’s the difference between a profitable AI feature and a profitable fork of your invoice.
- Your eval suite needs a reasoning track. Tasks you’ve been automatically passing with cheap models now fail in interesting new ways — and tasks you gave up on now pass. The only way to know which is which in your specific domain is to run both tiers against the same eval suite and read the differences.
- Hybrid retrieval is the default, not the upgrade. BM25 plus dense embeddings plus a small reranker, wired together cleanly, beats pure-embedding RAG on almost every realistic enterprise task we’ve measured in the last year. If your retrieval stack is still “cosine similarity over an embedding index”, you’re leaving accuracy on the table.
- Grounding quality, not answer quality, is the real metric. Any halfway-capable model can produce a confident-sounding answer. What matters is whether each claim in that answer is actually supported by the retrieved context. Citation-level evaluation is replacing output-level evaluation on the projects we’re shipping.
- Budget for inference cost, not just hosting. Reasoning models can burn five hundred thousand tokens on a single hard question. Caching, deduplication, prompt compression, and eager escalation to cheaper models for the easy cases are now core architecture concerns — not bonus optimisations for after launch.

The biggest shift in the conversation we’re having with customers right now is the move from “let’s try AI for this” to “let’s rebuild the workflow around what AI can actually do.” The teams that spent 2023 and 2024 building careful foundations — eval harnesses, observability, boring infrastructure — are the ones now moving fastest, because they can adopt the new generation of models without tearing up the old one. The teams that skipped that work are, quietly, at a standstill.
