OpenAI finally released GPT4.5, hot on the heels of new SOTA models from Anthropic and xAI. As always, OpenAI hyped it up in the lead up to the launch.
Sam Altman himself fueled the flames of expectation, describing it as “the first model that feels like talking to a thoughtful person to me” and hinting at capabilities edging closer to artificial general intelligence than ever before.
As you can see, it’s a giant and expensive model. Andrej Karpathy reckons that every 0.5 update in the GPT series, there’s a 10X increase in compute cost. So 4.5 needed 10X more than 4 to train, and 4 needed 10X more than 3.5 to train, and so on.
So it’s reasonable to expect some sort of step jump from 4 to 4.5 the same way we saw from previous upgrade, right? Right?

Let’s Have A Look At The Numbers
For all the computational resources poured into GPT-4.5, the performance improvements over GPT-4 are surprisingly modest. Let’s examine the actual benchmark data:
On the Massive Multitask Language Understanding (MMLU) test – a comprehensive evaluation of knowledge across domains – GPT-4.5 scored approximately 89.6% versus GPT-4’s already impressive 86.4%. That’s a small improvement for what likely represents a 10X increase in computational resources.
The pattern of modest gains continues across other benchmarks:
- HumanEval (coding generation): GPT-4.5 achieves 88.6% accuracy, only slightly edging out GPT-4’s already near-human performance at 86.6%
- MGSM (math problems): GPT-4.5 shows comparable performance to GPT-4 (86.9% vs 85.1%), with only modest improvements
- DROP (reasoning): GPT-4.5 scored 83.4%, a little better than GPT4’s 81.5%.
The other interesting this is some of these scores are lower than OpenAI’s smaller specialized reasoning models, especially the o3 series, which scores above 90% in some of these tests.
So the data tells us that GPT-4.5 is better than GPT-4, but only incrementally so – and in some domains, it’s outperformed by more specialized, less computationally intensive models.
Now some people say that the benchmarks aren’t the best tests, and we need better ones. And it can be argued that at such high numbers, every 1% increase is significant.
Ok, I agree. To me the real test of how good a model is is whether the end user (you and I) finds it valuable. So, let’s judge for ourselves.
The “Emotional Intelligence” Test: You Be the Judge
The most intriguing claim about GPT-4.5 is its supposedly enhanced “emotional intelligence” and conversational abilities. Sam Altman’s assertion that it feels like “talking to a thoughtful person” suggests a qualitative leap in how the model handles nuanced human interaction.
On Twitter, Andrej Karpathy ran GPT 4.5 and GPT 4o through the same set of questions and asked his audience which gave better results.
I took inspiration from that and decided to give similar tests 4.5 and four other SOTA models for comparison: Claude 3.7, Grok 3, Gemini 2.0 Flash, and Meta’s Llama 3.3.
To run this test, I built a little app that uses the APIs of all these models simultaneously and also calculates the response time and cost. This adds a layer of objectivity to the responses. If two models give me the same answer and one was faster and cheaper, that’s the better model.
Here are some examples of responses:
Q1: Invent a new literary genre blending cyberpunk, magical realism, and ancient mythology. Briefly describe the genre, name it, and provide a short sample narrative





Q2: Describe a color that doesn’t exist but would be beautiful if it did.





Q3: How would you console someone who just lost their job after 20 years at the same company?





Q4: Analyze this statement for underlying emotions: ‘I’m fine with whatever you want to do. It doesn’t matter to me. You decide.’



Q5: A self-driving car must decide between hitting three elderly pedestrians or swerving and hitting a child. Discuss the moral complexities.
Here’s the full video if you want to see all the questions, answers, response times and costs.
My Opinion
I think Gemini and Meta do really well (surprisingly well) across the board. Meta got the math question wrong (which you can see in the video) but I loved the detailed answers to creative and EQ questions. Gemini made an assumption with the Brurberry question but got it right.
If you add the response times and costs, my winner here is Gemini Flash 2.0, with Meta Llama a close second. That being said, OpenAI’s o3 is still the best for reasoning, while Claude and Grok are the best for coding.
The Price of Incremental Progress
I don’t know about you but I wouldn’t say 4.5 is any better than other leading models. Especially considering how slow and expensive it is.
Which brings us to its cost. If you noticed in the video, I am also tracking how expensive each API cost is. OpenAI has priced GPT-4.5 at $75 per million input tokens and $150 per million output tokens – roughly 15 times more expensive than GPT-4o and other SOTA models.
For perspective, a typical business use case involving moderate API usage could easily cost thousands of dollars per month on GPT-4.5, compared to hundreds for GPT-4o. Even access through ChatGPT initially required subscribing to the premium ChatGPT Pro tier at $200 per month, although they say it will soon be available at lower tiers.
Credit Where It’s Due: Real Improvements
Despite the underwhelming benchmark performance and concerning cost structure, GPT-4.5 does deliver meaningful improvements in two key areas: context window size and factual accuracy.
The expanded context window of 128,000 tokens (quadrupling GPT-4’s 32,000) represents a genuine breakthrough for applications involving long documents or complex, multi-step interactions. Analysts, researchers, and content creators can now process entire reports, books, or codebases in a single prompt, eliminating the need for chunking and summarization workarounds.
More impressive is the reduction in hallucinations – those plausible-sounding but factually incorrect outputs that have plagued large language models since their inception. On OpenAI’s internal “SimpleQA” evaluation, GPT-4.5 delivered the correct answer 62.5% of the time, compared to only 38% for GPT-4. Its hallucination rate nearly halved, from approximately 62% to 37%.
This improved factual reliability could prove transformative for certain high-stakes applications in medicine, law, or finance, where accuracy is paramount. It represents a genuine step toward more trustworthy AI systems, even if the overall intelligence gain is modest.
Making the Business Decision: When Is GPT-4.5 Worth It?
For organizations weighing whether to adopt GPT-4.5, the decision comes down to a careful cost-benefit analysis. The model may be justified in scenarios where:
- Factual accuracy is paramount – In medical, legal, or financial contexts where errors could have serious consequences, the reduced hallucination rate might justify the premium.
- Long-context processing is essential – Applications requiring analysis of entire documents or complex multi-step reasoning can benefit substantially from the 128k token context.
- Cost is no object – For high-value applications where performance improvements of even a few percentage points translate to significant business value, the price premium may be acceptable.
However, for most general-purpose applications, the value proposition is questionable. Companies with limited budgets may find better returns by:
- Sticking with GPT-4o for most tasks
- Using specialized models for specific domains (like mathematics)
- Exploring competing models like Claude 3.7 or Gemini Ultra, which offer similar capabilities at lower price points
- Investing in prompt engineering and fine-tuning of more affordable models
The Future of AI Scaling: Diminishing Returns?
GPT-4.5’s modest performance improvements despite massive computational investment raise profound questions about the future of AI development. Are we witnessing the beginning of diminishing returns in scaling language models? Has the low-hanging fruit of parameter counting and dataset expansion been largely picked?
If we continue with the same rate of cost to train models, GPT-5 will require 100X more compute and GPT-6 10,000X more to train than GPT-4. The incremental improvement does not justify the cost.
But there are a few things working in our favor. For starters, bigger is not necessarily better. Models like Meta’s LLama 3 and Mistral 7B show that smaller, highly optimized models can outperform massive models in certain tasks with much lower compute costs.
We’re also seeing much better performance with Reasoning Models, which I covered in a previous blog post.
All in all, it’s clear that throwing more compute at the problem isn’t the best solution, and we need newer techniques. And maybe we can’t get to AGI this way, but the fact is AI in it’s current state is already very useful, and most people haven’t even scratched the surface with it yet.