Yesterday OpenAI rolled out o3, the first reasoning model that is also agentic. Reasoning models have been around for a while, and o3 has been around in it’s mini version as well.
However, the full release yesterday showed us a model that not only reasons, but can browse, run Python, and look at your images in multiple thought loops. It behaves differently than the reasoning models we’ve seen so far, and that makes it unique.
OpenAI even hinted it “approaches AGI—with caveats.” Of course, OpenAI has been saying this for four years with every new model release so take it with a pinch of salt. That being said, I did want to test this out and compare it to the current top model (Gemini 2.5 Pro) to see if it’s better.
What the experts and the numbers say
Before we get into the 4 tests I ran both models through, let’s look at the benchmarks and a snapshot of what o3 can do.
Capability | o3 highlights |
---|---|
Benchmarks | 22.8 % jump on SWE‑Bench Verified coding tasks and one missed question on AIME 2024 math. |
Vision reasoning | Rotates, crops, zooms, and then reasons over the edited view. It can “think with images“ |
Full‑stack tool use | Seamlessly chains browsing, Python, image generation, and file analysis (no plug‑in wrangling required). |
Access & price | Live for Plus, Pro, and Team; o3‑mini even shows up in the free tier with light rate limits. |
Field‑testing o3 against Gemini 2.5 Pro
Benchmarks are great but I’ve stopped paying much attention to them recently. What really counts is if it can do what I want it to do.
Below are four experiments I ran, pitting o3 against Google’s best reasoning model in areas like research, vision, coding, and data science.
Deep‑dive research
I started with a basic research and reasoning test. I asked both models the same prompt: “What are people saying about ChatGPT o3? Find everything you can and interesting things it can do.”
Gemini started by thinking about the question, formulating a search plan, and executing against it. Because o3 is a brand new model, it’s not in Gemini’s training data, so it wasn’t sure if I meant o3 or ChatGPT-3 or 4o (yeah OpenAI’s naming confuses even the smartest AI models).
So to cover all bases, Gemini came up with 4 search queries and ran them in parallel. When the answers came back, it combined them all and gave me a final response.

o3, on the other hand, took the Sherlock route – search, read, reason, search again, fill a gap, repeat. The final response stitched together press reactions, Reddit hot takes, and early benchmark chatter.

This is where that agentic behaviour of o3 shines. As o3 found answers to its initial searches, it reasoned more and ran newer searches to plug gaps in the response. The final answer was well-rounded and solved my initial query.
Gemini only reasoned initially, and then after running the searches it combined everything into an answer. The problem is, because it wasn’t sure what o3 was when it first reasoned, one of the search queries was “what can ChatGPT do” instead of “what can o3 do”. So when it gave me the final answer, it didn’t quite solve my initial query.
Takeaway: Research isn’t a single pull‑request; it’s a feedback loop. o3 bakes that loop into the core model instead of outsourcing it to external agents or browser plug‑ins. When the question is fuzzy and context keeps shifting, that matters.
Image sleuthing
Now if you’ve used AI as much as I have, you’ll have realized that o3 research works almost like Deep Research, a feature that Gemini also has. And you’re right, it does.
But search isn’t the only tool o3 has in its arsenal. It can also use Python, and work with images, files, and more.
So my next test was to see if it could analyze and manipulate images. I tossed both models a picture of me taken in the Japan Pavilion at EPCOT, Disney World. I thought because of the Japanese background it might trip the model up.

Ninety seconds later o3 not only pinned the location but pointed out a pin‑sized glimpse of Spaceship Earth peeking over the trees far in the background, something I’d missed entirely.
I was surprised it noticed that, so I asked it to point it out to me. Using Python, it identified the object, calculated its coordinates, and put a red circle right where the dome is! It was able to do this because it went through multiple steps of reasoning and tool use, showcasing its agentic capabilities.
Gemini also got the location right, but it only identified the pagoda and torii gate, not Spaceship Earth. When I asked it to mark the torii gate, it could only describe its position in the image, but it couldn’t edit and send me back the image.
Takeaway: o3’s “vision ↔ code ↔ vision” loop unlocks practical image tasks like quality‑control checks, UI audits, or subtle landmark tagging. Any workflow that mixes text, numbers, code, and images can hand the grunt work to o3 while the human focuses on decision‑making.
Coding with bleeding‑edge libraries
Next up, I wanted to see how well it does with coding. Reasoning models by their nature are good at this, and Gemini has been my go-to recently.
I asked them both to “Build a tiny web app. One button starts a real‑time voice AI conversation and returns the transcript.”
The reason I chose this specific prompt is because Voice AI has improved a lot in recent weeks, and we’ve had some new libraries and SDKs come out around it. A lot of the newer stuff is beyond the cutoff date of these models.
So I wanted to see how well it does with gathering newer documentation and using that in its code versus what it already knows in its training data.
o3 researched the latest streaming speech API that dropped after its training cutoff, generated starter code, and offered the older text‑to‑speech fallback.
Gemini defaulted to last year’s speech‑to‑text loop and Google Cloud calls.
While both were technically correct and their code does work, o3 came back with the more up-to-date answer. Now, I could have pointed Gemini in the right direction and it would have coded something better, but that’s still an extra step that o3 eliminated out of the box.
Takeaway: o3’s autonomous web search makes it less likely to hand you stale SDK calls or older documentation.
Data analysis + forecasting
Finally, I wanted to put all the tools together into one test. I asked both models: “Chart how Canadian tourism to the U.S. is trending this year vs. last, then forecast to July 1.”
This combines search, image analysis, data analysis, python, and chart creation. o3’s agentic loop served it well again. It searched, found data, identified gaps, searched more, until it gave me a bar chart.
Initially, it only found data for January 2025, so it only plotted that. When I asked it for data on February and March, it reasoned a lot longer, ran multiple searches, found various data, and eventually computed an answer.

Gemini found numbers for January and March, but nothing for February, and since it doesn’t have that agentic loop, it didn’t explore further and try to estimate the numbers from other sources like o3 did.
The most impressive part though was when I asked both to forecast the numbers into summer. Gemini couldn’t find data and couldn’t make the forecast. o3 on the other hand did more research, looked at broader trends like the tariffs and border issues, school breaks, airline discount season, even the NBA finals, and made assumptions around how that would impact travel going into summer.

Takeaway: o3 feels like a junior quant who refuses to stop until every cell in the spreadsheet is filled (or at least justified). This combines search, reason, data analysis loop is invaluable for fields like investing, economics, finance, accounting, or anything to do with data.
Strengths, quirks, and when to reach for o3
Where it shines
- Multi‑step STEM problems, data wrangling, and “find the blind spot” research.
- Vision workflows that need both explanation and a marked‑up return image.
- Rapid prototyping with APIs newer than the model’s cutoff.
Where it still lags
- Creative long‑form prose: I still think Claude 3.7 is the better novelist but that’s personal preference.
- Sheer response latency: the deliberative pass can stretch beyond a minute.
- Token thrift: the reasoning trace costs compute; budget accordingly.
- Personal Advice: ChatGPT tends to be a bit of a sycophant so if you’re using it as a therapist or life coach, take whatever it says with a big pinch of salt.
Final thoughts
I’d love to continue testing o3 out for coding and see if it can replace Gemini 2.5 Pro, but I do think it is already stronger with research and reasoning. It’s the employe who keeps researching after everyone heads to lunch, circles details no one else spotted, and checks the changelog before committing code.
If your work involves any mix of data, code, images, or the open web (and whose work doesn’t) you’ll want that kind of persistence on tap. Today, that persistence is spelled o‑3.
Get more deep dives on AI
Like this post? Sign up for my newsletter and get notified every time I do a deep dive like this one.