Mathematical reasoning isn't won by a single champion anymore. It's won by those who know when to use which model for which problem.
I refreshed the Math Arena this morning and did a double-take. For the first time since I started tracking these rankings, OpenAI is no longer sitting at the top. Google's Gemini 3 Pro has seized the crown in mathematical reasoning, and the story only gets stranger from there. A Beijing-based startup called Moonshot just landed on the podium with a model most Western developers haven't even tried. After weeks of stress-testing the top contenders on everything from olympiad combinatorics to graduate-level real analysis, here's what the February data tells us about where mathematical AI is actually heading.
The Math Leaderboard
Mathematics remains the most honest benchmark in AI. You cannot charm your way through a differential equation or hallucinate a correct proof. An answer is right or it isn't. That binary clarity is what makes the Math Arena the benchmark I trust most when evaluating whether a model can truly reason. Here are all 60 ranked models as of February 2026.
| Rank | Model | Score | Votes | Organization |
|---|---|---|---|---|
🥇 | Gemini 3 Pro | 1484 | 2,252 | |
🥈 | Gemini 3 Flash | 1475 | 1,616 | |
🥉 | Kimi K2.5 Thinking | 1475 | 413 | Moonshot |
#4 | Gpt 5.2 High | 1469 | 952 | OpenAI |
#5 | Claude Opus 4 5 20251101 | 1469 | 1,879 | Anthropic |
#6 | Gpt 5.1 High | 1467 | 1,862 | OpenAI |
#7 | Claude Opus 4 5 20251101 Thinking 32k | 1467 | 1,585 | Anthropic |
#8 | Gemini 3 Flash (thinking Minimal) | 1464 | 1,038 | |
#9 | Ernie 5.0 0110 | 1462 | 580 | Baidu |
#10 | Claude Sonnet 4 5 20250929 Thinking 32k | 1458 | 2,657 | Anthropic |
#11 | O3 2025 04 16 | 1453 | 3,885 | OpenAI |
#12 | Gemini 2.5 Pro | 1451 | 5,845 | |
#13 | Grok 4.1 Thinking | 1450 | 2,058 | xAI |
#14 | Claude Opus 4 1 20250805 Thinking 16k | 1446 | 3,059 | Anthropic |
#15 | Qwen3 Max Preview | 1442 | 1,539 | Alibaba |
#16 | Kimi K2 Thinking Turbo | 1440 | 1,949 | Moonshot |
#17 | Gpt 5 High | 1439 | 1,939 | OpenAI |
#18 | Gpt 5.2 | 1438 | 698 | OpenAI |
#19 | Grok 4 0709 | 1438 | 2,309 | xAI |
#20 | Claude Opus 4 1 20250805 | 1435 | 4,553 | Anthropic |
#21 | Qwen3 Max 2025 09 23 | 1434 | 586 | Alibaba |
#22 | Grok 4.1 | 1433 | 2,552 | xAI |
#23 | Glm 4.7 | 1433 | 720 | Z.ai |
#24 | Grok 4 Fast Chat | 1430 | 403 | xAI |
#25 | Deepseek V3.2 Exp Thinking | 1429 | 478 | DeepSeek |
#26 | Deepseek V3.2 | 1429 | 1,680 | DeepSeek |
#27 | Claude Sonnet 4 5 20250929 | 1427 | 2,681 | Anthropic |
#28 | Deepseek V3.2 Exp | 1426 | 785 | DeepSeek |
#29 | Glm 4.6 | 1425 | 2,132 | Z.ai |
#30 | Qwen3 235b A22b Instruct 2507 | 1424 | 4,158 | Alibaba |
#31 | Longcat Flash Chat | 1424 | 694 | Meituan |
#32 | Qwen3 Next 80b A3b Instruct | 1423 | 1,232 | Alibaba |
#33 | Deepseek V3.1 Thinking | 1421 | 673 | DeepSeek |
#34 | Gpt 5.1 | 1421 | 2,191 | OpenAI |
#35 | Claude Opus 4 20250514 Thinking 16k | 1421 | 2,355 | Anthropic |
#36 | O4 Mini 2025 04 16 | 1419 | 3,042 | OpenAI |
#37 | Deepseek V3.1 | 1419 | 1,010 | DeepSeek |
#38 | Glm 4.5 | 1418 | 1,455 | Z.ai |
#39 | Kimi K2 0905 Preview | 1417 | 763 | Moonshot |
#40 | Gpt 5 Chat | 1417 | 1,813 | OpenAI |
#41 | Deepseek V3.1 Terminus Thinking | 1416 | 203 | DeepSeek |
#42 | Gemini 2.5 Flash Preview 09 2025 | 1415 | 1,955 | |
#43 | Qwen3 Vl 235b A22b Instruct | 1415 | 714 | Alibaba |
#44 | Grok 4 Fast Reasoning | 1415 | 1,085 | xAI |
#45 | Grok 4 1 Fast Reasoning | 1415 | 1,677 | xAI |
#46 | Gemini 2.5 Flash | 1414 | 6,074 | |
#47 | Gpt 4.5 Preview 2025 02 27 | 1414 | 1,384 | OpenAI |
#48 | Gpt 5 Mini High | 1413 | 1,460 | OpenAI |
#49 | Deepseek R1 | 1413 | 1,609 | DeepSeek |
#50 | Ernie 5.0 Preview 1203 | 1413 | 632 | Baidu |
#51 | Ernie 5.0 Preview 1022 | 1412 | 268 | Baidu |
#52 | O1 2024 12 17 | 1412 | 2,980 | OpenAI |
#53 | Qwen3 Vl 235b A22b Thinking | 1411 | 419 | Alibaba |
#54 | Mistral Large 3 | 1410 | 1,471 | Mistral |
#55 | O3 Mini High | 1409 | 1,906 | OpenAI |
#56 | Deepseek V3.2 Thinking | 1409 | 1,273 | DeepSeek |
#57 | Claude Sonnet 4 20250514 Thinking 32k | 1407 | 2,131 | Anthropic |
#58 | Qwen3 235b A22b Thinking 2507 | 1406 | 506 | Alibaba |
#59 | Hunyuan T1 20250711 | 1406 | 242 | Tencent |
#60 | Mistral Medium 2508 | 1405 | 3,912 | Mistral |
Google Takes the Crown
I've watched Google's mathematical AI evolve for three years, and what they've pulled off this month is nothing short of remarkable. Gemini 3 Pro didn't just edge into Gold. It arrived with clear daylight above the field. But the real power move? Gemini 3 Flash sitting right behind it at Silver. Google now holds both Gold and Silver simultaneously in the Math Arena. That has never happened before.
What makes this significant goes beyond rankings. It's the architecture strategy. Gemini 3 Pro is the heavyweight, built for maximum reasoning depth, the kind of model you point at research-level proofs and multi-step derivations. Gemini 3 Flash is optimized for speed and cost. The fact that a speed-optimized model can compete at the Silver level tells us Google has cracked something fundamental about how to make mathematical reasoning faster without sacrificing accuracy. The thinking-minimal variant at #8 offers yet another price-performance tradeoff, and older workhorses like Gemini 2.5 Pro at #12 and Gemini 2.5 Flash at #46 continue to serve reliably.
Google places six models in the top 60 across three generations and multiple price tiers. They aren't building one great math model. They're building an entire mathematical reasoning stack, from affordable Flash to flagship Pro, all sharing the same underlying advances.
My prediction: Google will hold this lead through at least mid-2026. Their approach of embedding mathematical reasoning as a core capability across the product line, rather than concentrating it in one flagship, is paying compounding dividends. If you're building anything that requires reliable mathematical computation, from financial modeling to scientific simulation, Gemini should be your first call right now.
The Moonshot Surprise
Here's the story nobody was writing three months ago. Moonshot's Kimi K2.5 Thinking has landed at #3, tied on points with Gemini 3 Flash for the Silver position. Let that register. A model from a startup founded in 2023 is mathematically level with Google's second-best offering.
I've been testing Kimi K2.5 Thinking extensively, and what strikes me is its approach to extended reasoning. Where other thinking models sometimes produce verbose chains of thought that circle a problem before landing, Kimi's reasoning feels almost unnervingly direct. It identifies the core mathematical structure quickly, then builds toward the solution with minimal detours. For competition-style problems where you need both accuracy and a clean logical chain, that directness is a genuine advantage.
Moonshot places three models in the top 60: Kimi K2.5 Thinking at #3, Kimi K2 Thinking Turbo at #16, and Kimi K2 at #39. Three tiers, one architecture philosophy. This kind of multi-tier presence from a startup is unprecedented. The message is clear: the era when only trillion-dollar companies could build world-class mathematical AI is over. Focused research investment in reasoning architecture can compete with massive compute budgets. Expect more labs to follow this playbook throughout 2026.
OpenAI After the Throne
Let me be direct. GPT-5.2 High, which held Gold since its debut, now sits at #4, tied with Claude Opus 4.5. The crown has been taken. But before anyone writes the obituary, look at the full picture.
OpenAI still places twelve models in the top 60, more than any other organization. That's not a company in crisis. That's a company with such ecosystem depth that even losing #1 leaves it dominating the middle and upper tiers. GPT-5.1 High holds #6. The o3 reasoning model at #11 remains my go-to for competition-level problems that demand deep multi-step computation. GPT-5 High at #17, the standard GPT-5.2 at #18, and o4-mini at #36 give builders options across every price tier and latency requirement.
The o-Series Advantage
OpenAI's dedicated reasoning models (o3, o4-mini, o1, o3-mini) occupy four positions in the top 60. For problems requiring extended computation, proving inequalities, constraint satisfaction, or combinatorial arguments, the o-series' adjustable thinking time remains uniquely powerful. No other provider offers this level of reasoning depth control.
Looking ahead, I believe OpenAI's response will come fast. The gap between GPT-5.2 High and Gemini 3 Pro is not insurmountable, and OpenAI's pattern has always been to iterate aggressively after losing ground. I would not be surprised to see a GPT-5.3 or a significant reasoning update before summer. The deeper story here isn't a fall. It's that the top of the Math Arena is now so fiercely competitive that holding #1 demands continuous innovation, not a single strong release.
The Thinking Model Revolution
Scan the top 10 of this leaderboard and count how many model names include the word "thinking." The answer is telling: Kimi K2.5 Thinking at #3, Claude Opus 4.5 Thinking at #7, Gemini 3 Flash thinking-minimal at #8, Claude Sonnet 4.5 Thinking at #10. Expand to the top 20 and they're everywhere. This is the single biggest structural shift in mathematical AI over the past year.
These models allocate additional compute at inference time to work through problems step by step before committing to an answer. It's the AI equivalent of a mathematician reaching for scratch paper before writing the final proof. The results are unambiguous: thinking variants consistently outperform their standard counterparts in mathematical tasks.
Anthropic's implementation tells this story especially well. Claude Opus 4.5 Thinking-32k at #7 outperforms the standard Opus 4.5 at #5 when given room to reason. Claude Sonnet 4.5 Thinking at #10 punches well above its weight class, cracking the top 10 despite being a mid-tier model by design. Anthropic places eight models total in the top 60, and their hallmark remains pedagogical clarity. When I need a model that doesn't just solve a problem but explains why the solution works in a way a student could genuinely learn from, Claude is still unmatched.
My prediction: by the end of 2026, the distinction between "standard" and "thinking" models will disappear. Every model will dynamically allocate reasoning time based on problem complexity. The current generation of explicitly labeled thinking variants is a transitional step toward universally adaptive reasoning.
The practical takeaway is simple: if accuracy matters more than latency, always choose the thinking variant. The mathematical uplift is consistent and real. For production applications where response time is critical, standard variants remain excellent. But for research, education, or any scenario where getting the right answer is paramount, thinking models are the present and the future.
The Global Math Landscape
Pull the camera back and the geography of this leaderboard tells its own story. Of the 60 ranked models, 26 come from Chinese organizations. That's 43% of the entire field. American labs hold 32 spots at 53%, and Mistral brings European representation with two models. Mathematical AI capability is now genuinely multipolar, and that shift has accelerated faster than almost anyone predicted.
DeepSeek stands out with eight models in the top 60, tied with Anthropic for the second-highest count after OpenAI. The v3.2 family across positions #25, #26, #28, and #56 offers an impressive range, while the v3.1 series and the battle-tested DeepSeek R1 at #49 fill out the middle tiers. What makes DeepSeek remarkable is the cost-to-capability ratio. In my testing, DeepSeek V3.2 delivers top-30 mathematical performance at roughly a fifth of what flagship models charge. For teams operating at scale with budget constraints, that ratio is transformative.
Alibaba's Qwen3 family contributes seven models, from Qwen3 Max Preview at #15 down through open-weight variants that developers can fine-tune on their own infrastructure. That open-weight strategy matters for industries with data sovereignty requirements, and it's a deliberate ecosystem play. xAI's Grok family places six models, led by Grok 4.1 Thinking at #13, which continues to find elegant shortcuts in proof-style problems. Z.ai's GLM series holds three spots, Baidu contributes three ERNIE variants, and we see entries from Meituan and Tencent as well.
The depth and breadth of participation tells me where mathematical AI is heading: this is no longer a race between two or three frontrunners. It's an ecosystem, and the ecosystem is getting richer by the month. No single country, company, or research tradition can claim a monopoly on mathematical reasoning anymore. And for those of us building on these tools, that competition is the best thing that could happen.
My Field Guide
After years of testing these models on everything from olympiad problems to real-world engineering calculations, here's the question builders keep asking me: which model should I actually use? The honest answer depends entirely on what you're building.
Research-Grade Accuracy
Gemini 3 Pro at #1. Google's flagship leads in raw mathematical capability. My first choice for novel problems where correctness is non-negotiable.
Speed Without Sacrifice
Gemini 3 Flash at #2. Near-podium accuracy at significantly lower latency and cost. Perfect for production math pipelines that need both quality and throughput.
The Dark Horse
Kimi K2.5 Thinking at #3. Moonshot's reasoning approach is remarkably efficient. Worth exploring seriously if you haven't yet, particularly for competition-style problems.
Ecosystem Depth
OpenAI with twelve models across every tier. The o-series for competition math, GPT-5.x for general reasoning. No other provider offers this range.
Best Explanations
Claude with eight models in the top 60. When understanding why an answer is correct matters as much as the answer itself. Unmatched pedagogical clarity.
Budget Champion
DeepSeek with eight models in the top 60. Top-30 capability at a fraction of the cost. Essential for teams building at scale or in cost-sensitive environments.
There is no single best mathematical AI. The winning strategy in 2026 is orchestration: Gemini for top-tier accuracy and speed, OpenAI's o-series for deep reasoning, Claude for explainability, DeepSeek and Kimi for efficiency. Build your pipeline with multiple providers and you will consistently outperform any single model.
Data Source: Rankings from AI Arena Math Leaderboard, February 6, 2026.
Discussion
0 commentsLeave a comment
Be the first to share your thoughts on this article!