OpenAI announced GPT-4 last week, and the ripples have reached every enterprise architecture meeting we’ve sat in since. After the dust has settled on the demos and the Twitter threads, what actually changes for teams building AI into real products? We’ve spent the last few days running our own benchmarks, and a few things stand out.
The short version: GPT-4 makes the line between impressive research and production-ready enterprise tool considerably harder to draw. That doesn’t mean you should rip out everything you built on GPT-3.5 — it means the ceiling has moved, and your roadmap has almost certainly moved with it.

Five things GPT-4 changes for teams deploying AI this year:
- Longer context, meaningfully used. The 8k and 32k token variants unlock document-length reasoning — contract review, full-transcript Q&A, codebase exploration — without the retrieval gymnastics we’ve all been writing for the last year. Your RAG layer doesn’t go away, but the shape of the problems it needs to solve gets easier.
- Multimodal inputs enter the chat. The ability to accept images as inputs shifts the workflow model for engineering, healthcare and operations teams where the source of truth is a diagram, a chart, or a scanned form rather than a clean block of text.
- Structured output gets reliable enough to deploy. For tasks that require strict JSON, function calls, or generated SQL, the reliability gap between 3.5-turbo and GPT-4 is large enough to reshape build-vs-buy decisions teams made only six months ago.
- Cost and latency still matter. GPT-4 is roughly 10× more expensive and 2–3× slower than 3.5-turbo. Model routing — a cheap model for easy requests, a big model for hard ones — will be the most consequential architecture choice most teams make this year.
- Governance just got harder. The same improvements that make GPT-4 useful also make it more dangerous to deploy carelessly. Start with red-teaming, structured logging, and a rollback path before you ship the first production feature — not after a post-mortem.

The companies that benefited most from the GPT-3 era weren’t the ones with the flashiest demos — they were the ones that built honest internal benchmarks and measured improvement quarter over quarter. That advice hasn’t changed. If you can’t tell the difference between GPT-4 and your existing model on the tasks that matter to your business, you don’t yet have the right evaluation harness. Build that first. The capabilities will keep moving; your ability to measure them is the compounding investment.
