Released Feb 19, 2026

Gemini 3.1 Pro Benchmark Showdown

The most comprehensive comparison of frontier AI models. See how Google's latest stacks up against Claude Opus 4.6, GPT-5.2, Codex, and GLM-5.

Gemini 3.1 Pro
Google
Claude Opus 4.6
Anthropic
GPT-5.2
OpenAI
Codex
OpenAI
GLM-5
Zhipu AI
🧠
ARC-AGI-2 Score
77.1% ↑ 2x
Novel reasoning benchmark
💻
LiveCodeBench Elo
2,887 ↑ 18%
Competitive coding
🔬
GPQA Diamond
94.3%
Graduate-level science
🏆
Benchmarks Won
11 / 17
#1 position overall

🧠 Reasoning & Knowledge Benchmarks

How each model performs on reasoning-intensive tasks

💻 Coding & Software Engineering

Performance on coding benchmarks and real-world SWE tasks

📊 Multi-Dimensional Performance

Comparing capabilities across all dimensions

Key Insights

🎯

Gemini 3.1 Pro Dominates Reasoning

77.1% on ARC-AGI-2 is the largest single-generation reasoning gain among frontier models.

Opus 4.6 Still Wins at SWE

Narrowly leads on SWE-Bench Verified (80.8% vs 80.6%) and expert tasks.

🔧

Codex for Specialized Coding

Codex (GPT-5.3) leads Terminal-Bench 2.0 at 77.3% for dedicated coding workflows.

🏆 Benchmark Leaderboard

Overall ranking across all tested benchmarks

Top Performing Models

18 benchmarks tested
1
Gemini 3.1 Pro
Google
11 wins
↑ NEW
2
Claude Opus 4.6
Anthropic
3 wins
↓ 1
3
Codex (GPT-5.3)
OpenAI
2 wins
4
GPT-5.2
OpenAI
0 wins
↓ 2
5
GLM-5
Zhipu AI
0 wins

💰 Pricing Comparison

Cost per 1M tokens (input/output)

Gemini 3.1 Pro
Google
$2/$12
per 1M tokens
1M context window
64K output tokens
3 thinking levels
75% cache savings
Claude Opus 4.6
Anthropic
$5/$25
per 1M tokens
1M context window
64K output tokens
Extended thinking
MCP integration
GPT-5.2
OpenAI
$2.50/$10
per 1M tokens
200K context
128K output
Native execution
General purpose
Codex (GPT-5.3)
OpenAI
$2.50/$10
per 1M tokens
400K context
128K output
Native execution
Specialized coding
GLM-5
Zhipu AI
$1/$3.20
per 1M tokens
200K context
128K output
Chinese optimized
Budget friendly

📈 Score Comparison by Benchmark

Detailed breakdown across all major benchmarks