Gemini 3 Pro: The best AI Model, by a mile

I’m really excited by this one. When Gemini 2.5 Pro came out months ago, it was incredible, but Anthropic and OpenAI caught up quickly.

Gemini 3 is something else altogether. It’s head and shoulders above the rest.

I built a personal AI boxing coach that tracks my hand movements through my computer’s camera in real-time and gives me feedback on my punching combinations. It generated the entire working app in about two minutes from a single vague prompt.

I’ll show you exactly how that works later in this post. But first, let’s look at what makes Gemini 3 different from previous models, and how it compares to Claude and GPT-5.

Benchmarks

Ok, benchmarks can be gamed and shouldn’t be used as the only metric for model selection, but it still gives us a directionally correct view of a model’s capabilities and how it compares with others.

Gemini 3 Pro hit 37.5% on Humanity’s Last Exam without using any tools. This benchmark tests PhD-level reasoning across science, math, and complex problem-solving. A score of 37.5% means it’s solving problems that would stump most humans with advanced degrees. For context, GPT-5 scored 26.5% on the same test.

The GPQA Diamond benchmark tells us even more about the model’s capabilities. Gemini 3 scored 91.9% on questions requiring graduate-level knowledge in physics, chemistry, and biology, putting it quite ahead of the others.

The 23.4% score on MathArena Apex is particularly impressive because this benchmark specifically tests advanced mathematical reasoning. Other models struggled to break single digits on this test.

This matters more than you might think. Mathematical reasoning underlies so much of what we ask AI to do, from analyzing data to writing algorithms to solving optimization problems. A model that can handle complex math can handle the logical reasoning required for most technical tasks.

But the benchmark that matters most for my work is coding performance. Gemini 3 Pro scored 54.2% on Terminal-Bench 2.0, far ahead of the next best, which tests a model’s ability to operate a computer via terminal. This benchmark is about understanding how to navigate file systems, run commands, debug errors, install dependencies, and actually operate like a developer would.

How It Compares to Claude and GPT-5

Before Gemini 3, my workflow was split between models based on their strengths. Claude 4.5 Sonnet was my primary coding and writing model. The reasoning was solid, the code quality was reliable, and it rarely needed multiple iterations to get things right. It understood context well and made reasonable architectural decisions.

GPT-5 handled everything else. Data analysis, structured tasks, anything that required processing large amounts of information quickly and presenting it in organized formats.

Now with Gemini 3, I’m testing whether I can consolidate to a single model. The early signs are promising. The coding quality matches or exceeds Claude for the tests I’ve run so far. The reasoning feels tighter and more consistent than GPT-5. The multimodal understanding (working with images, video, and text simultaneously) is better than either competitor.

And it’s cheaper.

I’ll spend the next few days pushing it harder to see if these early positive impressions hold up under sustained use, but this is the first model in months that feels like it might be genuinely all-in-one capable rather than best-in-class for specific tasks.

What I Actually Built With It

To properly test Gemini 3’s capabilities, I needed to move beyond simple prompts and build something with real complexity. I wanted to see how it handled tasks that require understanding vague requirements, making architectural decisions, and implementing features that involve multiple moving parts.

The Boxing Coach Demo

I gave it this prompt: “Build me an app that is a boxing teacher, use my computer’s camera to track my hands, display on the screen some image to tell me what combination to throw, maybe paddles, and then track my hand hitting the objects.”

This is a deliberately vague prompt. I’m describing the outcome I want without specifying the implementation details. I’m not telling it which computer vision library to use, how to structure the tracking logic, what the UI should look like, or how to handle the timing of combinations.

Gemini 3 understood what I was asking for and went several steps further. It built real-time computer vision tracking using the computer’s camera, which is non-trivial to implement correctly. It overlaid target indicators on the screen that show where to punch.

But it also recognized that this was meant to be a training tool, not just a detection system, so it added a complete scoring system to track accuracy, a streak counter to gamify the experience and keep you motivated, estimated calorie burn based on the activity, and multiple difficulty levels labeled “light,” “fighter,” and “champion.”

The entire implementation took about two minutes and worked on the first try. No debugging. No iteration. It one-shotted a complex implementation that involved computer vision, real-time tracking, UI overlay, game logic, scoring mechanics, and even some basic exercise physiology calculations.

The Personal Finance Tracker

For the second test, I wanted to see how it handled a more practical business application. I asked it to build a personal finance expense tracker that uses AI to look at screenshots or uploaded receipts and automatically categorizes expenses.

Gemini 3 figured out the architecture it would need (frontend interface for uploading receipts, backend processing to handle the files, AI integration for optical character recognition and categorization logic), and started building all the components.

The receipt scanning hit an edge case during my demo. I uploaded an Apple HEIC image format and the code expected JPEG. So it’s not a God model but it’s also the kind of thing that’s trivial to fix once you identify it.

When I uploaded a JPEG instead, it worked correctly. The model extracted the merchant name, the amount, the date, and made a reasonable guess at categorizing the expense.

This tells me something important about the current state of AI coding assistants. Gemini 3 can build production-quality architecture and implement complex features correctly. It understands the problem domain well enough to make good decisions about structure and flow. But it still makes assumptions about inputs and edge cases that you’d catch in code review. It’s not replacing careful testing and validation, but it’s dramatically reducing the time from idea to working prototype.

The Five Ways to Access Gemini 3

Google being Google has like 6 or 7 different apps and platforms from which you can access the model and some of these have the same name which is confusing as hell. But I digress.

AI Mode in Google Search

This is the first time Google has shipped a new Gemini model directly into Google Search on day one. That’s a significant shift in strategy. Previous models launched in limited betas, gradually rolling out to small groups of users while Google monitored for problems. This is a full production deployment to billions of users immediately, which signals a level of confidence in the model’s reliability that wasn’t there for previous releases.

AI Mode introduces “generative interfaces” that automatically design customized user experiences based on your prompt. Upload a PDF about DNA replication and it might generate both a text explanation and an interactive simulation showing how base pairs split and replicate. Ask about travel planning and it generates a magazine-style interface with photos, modules, and interactive prompts asking about your preferences for activities and dining.

The model is making UI decisions on the fly. It’s deciding “this question would be better answered with an interactive calculator” or “this needs a visual timeline” and then building those interfaces in real-time. This is something that Perplexity has been trying to do for a while, and Google just came out and nailed it.

The Gemini App

This is the ChatGPT-equivalent interface available at gemini.google.com. You’ll want to select “Thinking” mode to use Gemini 3 Pro rather than the faster but less capable Flash model.

I tested the creative writing capabilities by asking it to write about Gemini 3 in the style of a science fiction novel. The output started with “The whispers began as a faint hum, a resonance in the deep network…” and maintained that tone throughout several paragraphs.

What struck me was how it avoided the typical AI writing tells. You know the ones I’m talking about. The “it’s not just X, it’s Y” construction that appears in every ChatGPT essay. The overuse of em-dashes that no human writer actually uses that frequently. The breathless hype that creeps into every topic, making even mundane subjects sound like earth-shattering revelations.

Gemini 3’s output felt notably cleaner. More measured. Less like it was trying to convince me how excited I should be about the topic.

I still wouldn’t publish it without editing (it’s AI-generated prose, not literature) but it doesn’t immediately announce itself as AI-written the way GPT outputs tend to do. That matters if you’re using AI as part of your writing process rather than as a complete replacement for human writing.

AI Studio for Rapid Prototyping

This is Google’s developer playground with a “Build Mode” that’s particularly useful for quick prototyping. If you’re a product manager who needs to see three variations of a feature before your next standup, or a designer who wants to test an interaction pattern before committing to a full implementation, this is where you go.

Everything runs in the browser. You can test it immediately, see what works and what doesn’t, modify the code inline, and then download the result or push it directly to GitHub. The iteration loop is fast enough that you can explore multiple approaches in the time it would normally take to carefully code one version.

This is where I built the boxing coach demo. I pasted in my prompt, it generated all the code, and I could immediately test it in the browser to see the camera tracking and UI overlays working in real-time.

Gemini CLI for Development Work

The Gemini CLI is similar to Claude Code, a command-line interface where you can ask it to build applications and it creates all the necessary files, writes the code, and sets up the project structure.

This is where I built the personal finance tracker. I gave it one prompt describing what I wanted, and it figured out the requirements, came up with an implementation plan, asked for my Google Gemini API key (which it would need for the receipt processing functionality), and started generating files.

The CLI is better than the AI Studio for anything beyond frontend prototypes. If you need backend services, database schemas, API integrations, or multi-file projects with proper separation of concerns, this is the right tool for the job. It understands project structure and can scaffold out complete applications rather than single-file demos.

Google Antigravity

Antigravity is Google’s new agentic development platform where AI agents can autonomously plan and execute complex software tasks across your editor, terminal, and browser.

It looks like a Visual Studio Code fork, file explorer on the left, code editor in the middle, agent chat panel on the right. The interface is familiar if you’ve used any modern IDE. You can power it with Gemini 3, Anthropic’s Claude Sonnet 4.5, or OpenAI’s GPT-OSS models, which is an interesting choice. Google built an IDE and made it model-agnostic rather than locking it to their own models.

The feature that sets Antigravity apart is Agent Manager mode. Instead of working directly in the code editor with AI assistance responding to your prompts, you can spin up multiple independent agents that run tasks in parallel. You could have one agent researching best practices for building personal finance apps, another working on the frontend implementation, and a third handling backend architecture, all running simultaneously without you needing to context-switch between them.

This isn’t drastically different from running multiple tasks sequentially in the CLI. The underlying capability is similar. The value is in the interface. You can see what’s happening across all the agents in one view, manage them from a single place, and stay in the development environment instead of switching between terminal windows. It’s the same core capability wrapped in significantly better user experience.

I’m planning a full deep dive on Antigravity because there’s more to explore here. Subscribe below to read it.

Where This Fits in the AI Race

The AI model race is now operating on a cadence where major releases from all three companies happen within weeks of each other. Each release raises the baseline for what’s expected from frontier models. Features that were impressive and novel six months ago are now table stakes that every competitive model needs to match.

What’s interesting about Gemini 3 is that it’s not just incrementally better in one dimension. It’s showing meaningful improvements across multiple dimensions simultaneously. Better reasoning, better coding, better multimodal understanding, better interface generation.

That’s rare. Usually you get big improvements in one area at the cost of regressions elsewhere, or small improvements across the board. Genuine leaps across multiple capabilities at once are uncommon.

What I’m Testing Next

I’m planning to use Gemini 3 as my primary model for the next week to see if the early positive impressions hold up under sustained use. The areas I’m specifically testing are code quality on complex refactoring tasks, reasoning performance on strategic planning problems, and reliability when building multi-file projects with proper architecture.

I’m also diving deeper into Antigravity to understand how the multi-agent system handles coordination, how they handle conflicts when multiple agents are working on related code, and how reliable they are when running unsupervised for extended periods.

The boxing coach and finance tracker were quick tests to see if it could handle real-time complexity and practical business logic. Next I want to see how it performs on the kind of work I do daily, building AI agents, writing technical documentation, debugging production issues, and architecting new systems from scratch.

If it holds up across these more demanding tests, this might actually become the all-in-one model I’ve been waiting for. The real test is whether it’s still impressive after a week of daily use when the novelty has worn off.

Have you tried Gemini 3 yet? What are you planning to build with it?

Get more deep dives on AI

Like this post? Sign up for my newsletter and get notified every time I do a deep dive like this one.