Released Feb 19, 2026
Gemini 3.1 Pro Benchmark Showdown
The most comprehensive comparison of frontier AI models. See how Google's latest stacks up against Claude Opus 4.6, GPT-5.2, Codex, and GLM-5.
🧠
ARC-AGI-2 Score
77.1% ↑ 2x
Novel reasoning benchmark
💻
LiveCodeBench Elo
2,887 ↑ 18%
Competitive coding
🔬
GPQA Diamond
94.3%
Graduate-level science
🏆
Benchmarks Won
11 / 17
#1 position overall
🧠 Reasoning & Knowledge Benchmarks
How each model performs on reasoning-intensive tasks
💻 Coding & Software Engineering
Performance on coding benchmarks and real-world SWE tasks
📊 Multi-Dimensional Performance
Comparing capabilities across all dimensions
Key Insights
🎯
Gemini 3.1 Pro Dominates Reasoning
77.1% on ARC-AGI-2 is the largest single-generation reasoning gain among frontier models.
⚡
Opus 4.6 Still Wins at SWE
Narrowly leads on SWE-Bench Verified (80.8% vs 80.6%) and expert tasks.
🔧
Codex for Specialized Coding
Codex (GPT-5.3) leads Terminal-Bench 2.0 at 77.3% for dedicated coding workflows.
🏆 Benchmark Leaderboard
Overall ranking across all tested benchmarks
Top Performing Models
18 benchmarks tested
💰 Pricing Comparison
Cost per 1M tokens (input/output)
Gemini 3.1 Pro
Google
$2/$12
per 1M tokens
✓ 1M context window
✓ 64K output tokens
✓ 3 thinking levels
✓ 75% cache savings
Claude Opus 4.6
Anthropic
$5/$25
per 1M tokens
✓ 1M context window
✓ 64K output tokens
✓ Extended thinking
✓ MCP integration
GPT-5.2
OpenAI
$2.50/$10
per 1M tokens
✓ 200K context
✓ 128K output
✓ Native execution
✓ General purpose
Codex (GPT-5.3)
OpenAI
$2.50/$10
per 1M tokens
✓ 400K context
✓ 128K output
✓ Native execution
✓ Specialized coding
GLM-5
Zhipu AI
$1/$3.20
per 1M tokens
✓ 200K context
✓ 128K output
✓ Chinese optimized
✓ Budget friendly
📈 Score Comparison by Benchmark
Detailed breakdown across all major benchmarks