Tendril
Real benchmark scores from LMSYS Arena, GPQA Diamond, AIME 2025, SWE-bench, MMMLU, and Humanity's Last Exam. No marketing benchmarks.
| Rank | Model | Overall | Arena Elo | GPQA Diamond | AIME 2025 | SWE-bench | Humanity's Last Exam | MMMLU | Context | Model cost |
|---|---|---|---|---|---|---|---|---|---|---|
| #1 | A Claude Opus 4.7 Anthropic • Apr 2026 | 90.2 | 1487 | 94.2% | 98.5% | 87.6% | 42.1% | 91.5% | 1M | $5 / $25 per 1M |
| #2 | O GPT-5.5 OpenAI • Apr 2026 | 89 | 1495 | 93.6% | 97.2% | 82.4% | 41.4% | 90.2% | 256K | $5 / $30 per 1M |
| #3 | G Gemini 3.1 Pro Google • Apr 2026 | 88.9 | 1487 | 91.9% | 100% | 78.3% | 45.8% | 91.8% | 10M | $2 / $12 per 1M |
| #4 | K Kimi K2.5 Thinking Kimi • Apr 2026 | 84 | 1445 | 88.5% | 99.1% | 72.1% | 44.9% | 88.2% | 256K | $0.6 / $2.5 per 1M |
| #5 | X Grok 4.20 xAI • Mar 2026 | 80.3 | 1456 | 86.3% | 92.4% | 68.5% | 35.2% | 86.7% | 2M | $3 / $15 per 1M |
| #6 | D DeepSeek R1 DeepSeek • Jan 2026 | 80 | 1424 | 85.7% | 96.3% | 71.2% | 38.5% | 85.1% | 128K | $0.55 / $2.19 per 1M |
| #7 | M Llama 4 Scout Meta • Nov 2025 | 66.2 | 1380 | 78.2% | 72.5% | 58.3% | 22.1% | 80.5% | 10M | Free (self-host) |
| #8 | M Mistral Large 3 Mistral • Feb 2026 | 64 | 1370 | 76.5% | 68.2% | 55.8% | 20.5% | 82.3% | 128K | ~$2 / ~$6 per 1M |
| #9 | C Command R+ Cohere • 2025 | 41.8 | 1250 | 58.2% | 35.4% | 38.5% | 8.2% | 72.1% | 128K | $2.5 / $10 per 1M |
| #10 | P Sonar Pro Perplexity • 2025 | 37.6 | 1280 | 52.1% | 28.3% | 25.4% | 5.1% | 68.5% | 128K | $20/mo Pro |
| #11 | A Jamba 1.6 AI21 • 2025 | 29.6 | 1180 | 45.2% | 22.1% | 28.3% | 4.5% | 62.3% | 256K | API / Custom |