Released Feb 19, 2026

Gemini 3.1 Pro Benchmark Showdown

The most comprehensive comparison of frontier AI models. See how Google's latest stacks up against Claude Opus 4.6, GPT-5.2, Codex, and GLM-5.

Gemini 3.1 Pro

Google

Claude Opus 4.6

Anthropic

GPT-5.2

OpenAI

Codex

OpenAI

GLM-5

Zhipu AI

🧠

ARC-AGI-2 Score

77.1% ↑ 2x

Novel reasoning benchmark

💻

LiveCodeBench Elo

2,887 ↑ 18%

Competitive coding

🔬

GPQA Diamond

94.3%

Graduate-level science

🏆

Benchmarks Won

11 / 17

#1 position overall

🧠 Reasoning & Knowledge Benchmarks

How each model performs on reasoning-intensive tasks

💻 Coding & Software Engineering

Performance on coding benchmarks and real-world SWE tasks

📊 Multi-Dimensional Performance

Comparing capabilities across all dimensions

Key Insights

🎯

Gemini 3.1 Pro Dominates Reasoning

77.1% on ARC-AGI-2 is the largest single-generation reasoning gain among frontier models.

⚡

Opus 4.6 Still Wins at SWE

Narrowly leads on SWE-Bench Verified (80.8% vs 80.6%) and expert tasks.

🔧

Codex for Specialized Coding

Codex (GPT-5.3) leads Terminal-Bench 2.0 at 77.3% for dedicated coding workflows.

🏆 Benchmark Leaderboard

Overall ranking across all tested benchmarks

Top Performing Models

18 benchmarks tested

Gemini 3.1 Pro

Google

11 wins

↑ NEW

Claude Opus 4.6

Anthropic

3 wins

↓ 1

Codex (GPT-5.3)

OpenAI

2 wins

—

GPT-5.2

OpenAI

0 wins

↓ 2

GLM-5

Zhipu AI

0 wins

—

💰 Pricing Comparison

Cost per 1M tokens (input/output)

Gemini 3.1 Pro

Google

$2/$12

per 1M tokens

✓ 1M context window

✓ 64K output tokens

✓ 3 thinking levels

✓ 75% cache savings

Claude Opus 4.6

Anthropic

$5/$25

per 1M tokens

✓ 1M context window

✓ 64K output tokens

✓ Extended thinking

✓ MCP integration

GPT-5.2

OpenAI

$2.50/$10

per 1M tokens

✓ 200K context

✓ 128K output

✓ Native execution

✓ General purpose

Codex (GPT-5.3)

OpenAI

$2.50/$10

per 1M tokens

✓ 400K context

✓ 128K output

✓ Native execution

✓ Specialized coding

GLM-5

Zhipu AI

$1/$3.20

per 1M tokens

✓ 200K context

✓ 128K output

✓ Chinese optimized

✓ Budget friendly

📈 Score Comparison by Benchmark

Detailed breakdown across all major benchmarks