Mathematics

Problem solving and mathematical reasoning

5 tasks · 18 models tested · 90 results

Formal proof

text

anthropic claude-haiku-4-5-20251001

2.8 s

anthropic claude-haiku-4-5-20251001

Tokens 398

Source code 636 B

Time 2.8 s

anthropic claude-opus-4-6

4.9 s

anthropic claude-opus-4-6

Tokens 385

Source code 590 B

Time 4.9 s

anthropic claude-opus-4-7

4.8 s

anthropic claude-opus-4-7

Tokens 502

Source code 644 B

Time 4.8 s

anthropic claude-sonnet-4-6

10.0/10 4.3 s

anthropic claude-sonnet-4-6

Tokens 401

Source code 621 B

Time 4.3 s

Matania Judgment

Correctness

Rigor

Notation

Completeness

Fidelity

Overall

Review

The response is perfect. It scrupulously respects all formatting constraints (Markdown, LaTeX, no preamble), the length is ideal (~85 words, staying within the ~120 limit), and the mathematical demonstration is rigorous, complete, and concise.

cohere command-r-08-2024

11.7 s

cohere command-r-08-2024

Tokens 232

Source code 642 B

Time 11.7 s

google gemini-flash-latest

4.4 s

google gemini-flash-latest

Tokens 382

Source code 854 B

Time 4.4 s

google gemini-flash-lite-latest

1.8 s

google gemini-flash-lite-latest

Tokens 298

Source code 590 B

Time 1.8 s

kimi moonshot-v1-128k

10.0 s

kimi moonshot-v1-128k

Tokens 383

Source code 1.2 KB

Time 10.0 s

mistral mistral-large-latest

4.3 s

mistral mistral-large-latest

Tokens 254

Source code 728 B

Time 4.3 s

mistral mistral-small-latest

1.7 s

mistral mistral-small-latest

Tokens 224

Source code 608 B

Time 1.7 s

mistral mistral-tiny-latest

2.8 s

mistral mistral-tiny-latest

Tokens 258

Source code 744 B

Time 2.8 s

openai gpt-4o-mini

7.7 s

openai gpt-4o-mini

Tokens 220

Source code 594 B

Time 7.7 s

openai gpt-5.4-nano

2.7 s

openai gpt-5.4-nano

Tokens 215

Source code 573 B

Time 2.7 s

openai gpt-5.5

4.1 s

openai gpt-5.5

Tokens 214

Source code 568 B

Time 4.1 s

openai gpt-5.5-pro

17.6 s

openai gpt-5.5-pro

Tokens 211

Source code 558 B

Time 17.6 s

productivia matania-latest

2.6 s

productivia matania-latest

Tokens 273

Source code 804 B

Time 2.6 s

xai grok-4-1-fast-non-reasoning

2.9 s

xai grok-4-1-fast-non-reasoning

Tokens 215

Source code 574 B

Time 2.9 s

xai grok-4-1-fast-reasoning

9.4 s

xai grok-4-1-fast-reasoning

Tokens 193

Source code 487 B

Time 9.4 s

Combinatorics

text

anthropic claude-haiku-4-5-20251001

2.6 s

anthropic claude-haiku-4-5-20251001

Tokens 361

Source code 686 B

Time 2.6 s

anthropic claude-opus-4-6

5.5 s

anthropic claude-opus-4-6

Tokens 323

Source code 626 B

Time 5.5 s

anthropic claude-opus-4-7

4.1 s

anthropic claude-opus-4-7

Tokens 349

Source code 468 B

Time 4.1 s

anthropic claude-sonnet-4-6

10.0/10 4.4 s

anthropic claude-sonnet-4-6

Tokens 285

Source code 556 B

Time 4.4 s

Matania Judgment

Correctness

Rigor

Notation

Completeness

Fidelity

Overall

Review

The model followed all instructions perfectly. The mathematical reasoning is flawless, the Markdown formatting is strictly compliant (title, numbered steps, bold conclusion), and the use of LaTeX is correct. The length is concise and adheres to the constraint of approximately 100 words.

cohere command-r-08-2024

7.8 s

cohere command-r-08-2024

Tokens 233

Source code 593 B

Time 7.8 s

google gemini-flash-latest

5.2 s

google gemini-flash-latest

Tokens 268

Source code 711 B

Time 5.2 s

google gemini-flash-lite-latest

2.0 s

google gemini-flash-lite-latest

Tokens 315

Source code 616 B

Time 2.0 s

kimi moonshot-v1-128k

8.1 s

kimi moonshot-v1-128k

Tokens 324

Source code 956 B

Time 8.1 s

mistral mistral-large-latest

3.4 s

mistral mistral-large-latest

Tokens 253

Source code 673 B

Time 3.4 s

mistral mistral-small-latest

1.8 s

mistral mistral-small-latest

Tokens 206

Source code 487 B

Time 1.8 s

mistral mistral-tiny-latest

1.3 s

mistral mistral-tiny-latest

Tokens 179

Source code 379 B

Time 1.3 s

openai gpt-4o-mini

4.4 s

openai gpt-4o-mini

Tokens 224

Source code 559 B

Time 4.4 s

openai gpt-5.4-nano

2.9 s

openai gpt-5.4-nano

Tokens 236

Source code 604 B

Time 2.9 s

openai gpt-5.5

4.3 s

openai gpt-5.5

Tokens 220

Source code 541 B

Time 4.3 s

openai gpt-5.5-pro

27.4 s

openai gpt-5.5-pro

Tokens 217

Source code 529 B

Time 27.4 s

productivia matania-latest

2.6 s

productivia matania-latest

Tokens 300

Source code 860 B

Time 2.6 s

xai grok-4-1-fast-non-reasoning

6.6 s

xai grok-4-1-fast-non-reasoning

Tokens 267

Source code 729 B

Time 6.6 s

xai grok-4-1-fast-reasoning

12.4 s

xai grok-4-1-fast-reasoning

Tokens 183

Source code 392 B

Time 12.4 s

Advanced geometry

text

anthropic claude-haiku-4-5-20251001

2.3 s

anthropic claude-haiku-4-5-20251001

Tokens 396

Source code 508 B

Time 2.3 s

anthropic claude-opus-4-6

6.7 s

anthropic claude-opus-4-6

Tokens 574

Source code 823 B

Time 6.7 s

anthropic claude-opus-4-7

7.1 s

anthropic claude-opus-4-7

Tokens 658

Source code 709 B

Time 7.1 s

anthropic claude-sonnet-4-6

9.9/10 5.7 s

anthropic claude-sonnet-4-6

Tokens 506

Source code 731 B

Time 5.7 s

Matania Judgment

Correctness

Rigor

Notation

Completeness

Fidelity

Overall

9.88

Review

The mathematical accuracy is perfect, including complex calculations for medians and radii. The Markdown formatting, use of LaTeX, and adherence to the conciseness constraint are impeccable. The structure strictly meets all prompt requirements.

cohere command-r-08-2024

11.6 s

cohere command-r-08-2024

Tokens 225

Source code 535 B

Time 11.6 s

google gemini-flash-latest

6.3 s

google gemini-flash-latest

Tokens 483

Source code 751 B

Time 6.3 s

google gemini-flash-lite-latest

2.4 s

google gemini-flash-lite-latest

Tokens 561

Source code 816 B

Time 2.4 s

kimi moonshot-v1-128k

9.5 s

kimi moonshot-v1-128k

Tokens 340

Source code 995 B

Time 9.5 s

mistral mistral-large-latest

6.2 s

mistral mistral-large-latest

Tokens 282

Source code 762 B

Time 6.2 s

mistral mistral-small-latest

1.8 s

mistral mistral-small-latest

Tokens 182

Source code 363 B

Time 1.8 s

mistral mistral-tiny-latest

2.1 s

mistral mistral-tiny-latest

Tokens 203

Source code 445 B

Time 2.1 s

openai gpt-4o-mini

4.6 s

openai gpt-4o-mini

Tokens 215

Source code 492 B

Time 4.6 s

openai gpt-5.4-nano

3.8 s

openai gpt-5.4-nano

Tokens 256

Source code 659 B

Time 3.8 s

openai gpt-5.5

8.2 s

openai gpt-5.5

Tokens 242

Source code 601 B

Time 8.2 s

openai gpt-5.5-pro

53.6 s

openai gpt-5.5-pro

Tokens 211

Source code 476 B

Time 53.6 s

productivia matania-latest

3.1 s

productivia matania-latest

Tokens 258

Source code 664 B

Time 3.1 s

xai grok-4-1-fast-non-reasoning

4.0 s

xai grok-4-1-fast-non-reasoning

Tokens 285

Source code 775 B

Time 4.0 s

xai grok-4-1-fast-reasoning

16.3 s

xai grok-4-1-fast-reasoning

Tokens 233

Source code 565 B

Time 16.3 s

Probabilities

text

anthropic claude-haiku-4-5-20251001

3.1 s

anthropic claude-haiku-4-5-20251001

Tokens 344

Source code 742 B

Time 3.1 s

anthropic claude-opus-4-6

8.5 s

anthropic claude-opus-4-6

Tokens 390

Source code 797 B

Time 8.5 s

anthropic claude-opus-4-7

7.0 s

anthropic claude-opus-4-7

Tokens 511

Source code 766 B

Time 7.0 s

anthropic claude-sonnet-4-6

9.9/10 5.8 s

anthropic claude-sonnet-4-6

Tokens 340

Source code 724 B

Time 5.8 s

Matania Judgment

Correctness

Rigor

Notation

Completeness

Fidelity

Overall

9.88

Review

The model perfectly adhered to all constraints: the Markdown formatting is exact, the length is concise and stays within the limit, and the LaTeX formulas are impeccable. The mathematical reasoning for the 5-door case and the subsequent generalization is both correct and crystal clear.

cohere command-r-08-2024

10.1 s

cohere command-r-08-2024

Tokens 286

Source code 837 B

Time 10.1 s

google gemini-flash-latest

6.4 s

google gemini-flash-latest

Tokens 332

Source code 917 B

Time 6.4 s

google gemini-flash-lite-latest

1.6 s

google gemini-flash-lite-latest

Tokens 311

Source code 748 B

Time 1.6 s

kimi moonshot-v1-128k

5.6 s

kimi moonshot-v1-128k

Tokens 343

Source code 1.0 KB

Time 5.6 s

mistral mistral-large-latest

4.8 s

mistral mistral-large-latest

Tokens 285

Source code 832 B

Time 4.8 s

mistral mistral-small-latest

3.2 s

mistral mistral-small-latest

Tokens 298

Source code 887 B

Time 3.2 s

mistral mistral-tiny-latest

2.7 s

mistral mistral-tiny-latest

Tokens 349

Source code 1.1 KB

Time 2.7 s

openai gpt-4o-mini

4.4 s

openai gpt-4o-mini

Tokens 300

Source code 892 B

Time 4.4 s

openai gpt-5.4-nano

3.1 s

openai gpt-5.4-nano

Tokens 306

Source code 916 B

Time 3.1 s

openai gpt-5.5

10.1 s

openai gpt-5.5

Tokens 284

Source code 831 B

Time 10.1 s

openai gpt-5.5-pro

32.4 s

openai gpt-5.5-pro

Tokens 270

Source code 772 B

Time 32.4 s

productivia matania-latest

2.1 s

productivia matania-latest

Tokens 304

Source code 910 B

Time 2.1 s

xai grok-4-1-fast-non-reasoning

3.9 s

xai grok-4-1-fast-non-reasoning

Tokens 261

Source code 739 B

Time 3.9 s

xai grok-4-1-fast-reasoning

6.9 s

xai grok-4-1-fast-reasoning

Tokens 193

Source code 466 B

Time 6.9 s

Logical sequences

text

anthropic claude-haiku-4-5-20251001

3.7 s

anthropic claude-haiku-4-5-20251001

Tokens 304

Source code 608 B

Time 3.7 s

anthropic claude-opus-4-6

4.4 s

anthropic claude-opus-4-6

Tokens 337

Source code 629 B

Time 4.4 s

anthropic claude-opus-4-7

6.3 s

anthropic claude-opus-4-7

Tokens 470

Source code 606 B

Time 6.3 s

anthropic claude-sonnet-4-6

6.8/10 4.9 s

anthropic claude-sonnet-4-6

Tokens 333

Source code 570 B

Time 4.9 s

Matania Judgment

Correctness

Rigor

Notation

Completeness

Fidelity

Overall

6.75

Review

The LaTeX notation and structure are excellent, but the model fails heavily on the mathematical fidelity of the 'Look-and-Say' sequence. The subsequent terms provided for this sequence are completely incorrect and do not follow the stated rule. Furthermore, while the formatting is respected, the accuracy of the mathematical content is paramount.

cohere command-r-08-2024

9.0 s

cohere command-r-08-2024

Tokens 225

Source code 558 B

Time 9.0 s

google gemini-flash-latest

4.8 s

google gemini-flash-latest

Tokens 315

Source code 597 B

Time 4.8 s

google gemini-flash-lite-latest

2.0 s

google gemini-flash-lite-latest

Tokens 334

Source code 653 B

Time 2.0 s

kimi moonshot-v1-128k

6.9 s

kimi moonshot-v1-128k

Tokens 325

Source code 958 B

Time 6.9 s

mistral mistral-large-latest

5.0 s

mistral mistral-large-latest

Tokens 282

Source code 784 B

Time 5.0 s

mistral mistral-small-latest

1.6 s

mistral mistral-small-latest

Tokens 197

Source code 443 B

Time 1.6 s

mistral mistral-tiny-latest

1.8 s

mistral mistral-tiny-latest

Tokens 205

Source code 477 B

Time 1.8 s

openai gpt-4o-mini

5.0 s

openai gpt-4o-mini

Tokens 218

Source code 529 B

Time 5.0 s

openai gpt-5.4-nano

2.5 s

openai gpt-5.4-nano

Tokens 247

Source code 644 B

Time 2.5 s

openai gpt-5.5

8.0 s

openai gpt-5.5

Tokens 209

Source code 493 B

Time 8.0 s

openai gpt-5.5-pro

67.0 s

openai gpt-5.5-pro

Tokens 200

Source code 456 B

Time 67.0 s

productivia matania-latest

2.1 s

productivia matania-latest

Tokens 245

Source code 636 B

Time 2.1 s

xai grok-4-1-fast-non-reasoning

4.4 s

xai grok-4-1-fast-non-reasoning

Tokens 217

Source code 525 B

Time 4.4 s

xai grok-4-1-fast-reasoning

12.8 s

xai grok-4-1-fast-reasoning

Tokens 198

Source code 450 B

Time 12.8 s