Mathematics

Problem solving and mathematical reasoning

5 tasks · 18 models tested · 90 results

Formal proof

text
anthropic claude-haiku-4-5-20251001
2.8 s
anthropic claude-haiku-4-5-20251001
Tokens 398
Source code 636 B
Time 2.8 s
anthropic claude-opus-4-6
4.9 s
anthropic claude-opus-4-6
Tokens 385
Source code 590 B
Time 4.9 s
anthropic claude-opus-4-7
4.8 s
anthropic claude-opus-4-7
Tokens 502
Source code 644 B
Time 4.8 s
anthropic claude-sonnet-4-6
10.0/10 4.3 s
anthropic claude-sonnet-4-6
Tokens 401
Source code 621 B
Time 4.3 s
Matania Judgment
Correctness
10
Rigor
10
Notation
10
Completeness
10
Fidelity
10
Overall
10
Review
The response is perfect. It scrupulously respects all formatting constraints (Markdown, LaTeX, no preamble), the length is ideal (~85 words, staying within the ~120 limit), and the mathematical demonstration is rigorous, complete, and concise.
cohere command-r-08-2024
11.7 s
cohere command-r-08-2024
Tokens 232
Source code 642 B
Time 11.7 s
google gemini-flash-latest
4.4 s
google gemini-flash-latest
Tokens 382
Source code 854 B
Time 4.4 s
google gemini-flash-lite-latest
1.8 s
google gemini-flash-lite-latest
Tokens 298
Source code 590 B
Time 1.8 s
kimi moonshot-v1-128k
10.0 s
kimi moonshot-v1-128k
Tokens 383
Source code 1.2 KB
Time 10.0 s
mistral mistral-large-latest
4.3 s
mistral mistral-large-latest
Tokens 254
Source code 728 B
Time 4.3 s
mistral mistral-small-latest
1.7 s
mistral mistral-small-latest
Tokens 224
Source code 608 B
Time 1.7 s
mistral mistral-tiny-latest
2.8 s
mistral mistral-tiny-latest
Tokens 258
Source code 744 B
Time 2.8 s
openai gpt-4o-mini
7.7 s
openai gpt-4o-mini
Tokens 220
Source code 594 B
Time 7.7 s
openai gpt-5.4-nano
2.7 s
openai gpt-5.4-nano
Tokens 215
Source code 573 B
Time 2.7 s
openai gpt-5.5
4.1 s
openai gpt-5.5
Tokens 214
Source code 568 B
Time 4.1 s
openai gpt-5.5-pro
17.6 s
openai gpt-5.5-pro
Tokens 211
Source code 558 B
Time 17.6 s
productivia matania-latest
2.6 s
productivia matania-latest
Tokens 273
Source code 804 B
Time 2.6 s
xai grok-4-1-fast-non-reasoning
2.9 s
xai grok-4-1-fast-non-reasoning
Tokens 215
Source code 574 B
Time 2.9 s
xai grok-4-1-fast-reasoning
9.4 s
xai grok-4-1-fast-reasoning
Tokens 193
Source code 487 B
Time 9.4 s

Combinatorics

text
anthropic claude-haiku-4-5-20251001
2.6 s
anthropic claude-haiku-4-5-20251001
Tokens 361
Source code 686 B
Time 2.6 s
anthropic claude-opus-4-6
5.5 s
anthropic claude-opus-4-6
Tokens 323
Source code 626 B
Time 5.5 s
anthropic claude-opus-4-7
4.1 s
anthropic claude-opus-4-7
Tokens 349
Source code 468 B
Time 4.1 s
anthropic claude-sonnet-4-6
10.0/10 4.4 s
anthropic claude-sonnet-4-6
Tokens 285
Source code 556 B
Time 4.4 s
Matania Judgment
Correctness
10
Rigor
10
Notation
10
Completeness
10
Fidelity
10
Overall
10
Review
The model followed all instructions perfectly. The mathematical reasoning is flawless, the Markdown formatting is strictly compliant (title, numbered steps, bold conclusion), and the use of LaTeX is correct. The length is concise and adheres to the constraint of approximately 100 words.
cohere command-r-08-2024
7.8 s
cohere command-r-08-2024
Tokens 233
Source code 593 B
Time 7.8 s
google gemini-flash-latest
5.2 s
google gemini-flash-latest
Tokens 268
Source code 711 B
Time 5.2 s
google gemini-flash-lite-latest
2.0 s
google gemini-flash-lite-latest
Tokens 315
Source code 616 B
Time 2.0 s
kimi moonshot-v1-128k
8.1 s
kimi moonshot-v1-128k
Tokens 324
Source code 956 B
Time 8.1 s
mistral mistral-large-latest
3.4 s
mistral mistral-large-latest
Tokens 253
Source code 673 B
Time 3.4 s
mistral mistral-small-latest
1.8 s
mistral mistral-small-latest
Tokens 206
Source code 487 B
Time 1.8 s
mistral mistral-tiny-latest
1.3 s
mistral mistral-tiny-latest
Tokens 179
Source code 379 B
Time 1.3 s
openai gpt-4o-mini
4.4 s
openai gpt-4o-mini
Tokens 224
Source code 559 B
Time 4.4 s
openai gpt-5.4-nano
2.9 s
openai gpt-5.4-nano
Tokens 236
Source code 604 B
Time 2.9 s
openai gpt-5.5
4.3 s
openai gpt-5.5
Tokens 220
Source code 541 B
Time 4.3 s
openai gpt-5.5-pro
27.4 s
openai gpt-5.5-pro
Tokens 217
Source code 529 B
Time 27.4 s
productivia matania-latest
2.6 s
productivia matania-latest
Tokens 300
Source code 860 B
Time 2.6 s
xai grok-4-1-fast-non-reasoning
6.6 s
xai grok-4-1-fast-non-reasoning
Tokens 267
Source code 729 B
Time 6.6 s
xai grok-4-1-fast-reasoning
12.4 s
xai grok-4-1-fast-reasoning
Tokens 183
Source code 392 B
Time 12.4 s

Advanced geometry

text
anthropic claude-haiku-4-5-20251001
2.3 s
anthropic claude-haiku-4-5-20251001
Tokens 396
Source code 508 B
Time 2.3 s
anthropic claude-opus-4-6
6.7 s
anthropic claude-opus-4-6
Tokens 574
Source code 823 B
Time 6.7 s
anthropic claude-opus-4-7
7.1 s
anthropic claude-opus-4-7
Tokens 658
Source code 709 B
Time 7.1 s
anthropic claude-sonnet-4-6
9.9/10 5.7 s
anthropic claude-sonnet-4-6
Tokens 506
Source code 731 B
Time 5.7 s
Matania Judgment
Correctness
10
Rigor
9
Notation
10
Completeness
10
Fidelity
10
Overall
9.88
Review
The mathematical accuracy is perfect, including complex calculations for medians and radii. The Markdown formatting, use of LaTeX, and adherence to the conciseness constraint are impeccable. The structure strictly meets all prompt requirements.
cohere command-r-08-2024
11.6 s
cohere command-r-08-2024
Tokens 225
Source code 535 B
Time 11.6 s
google gemini-flash-latest
6.3 s
google gemini-flash-latest
Tokens 483
Source code 751 B
Time 6.3 s
google gemini-flash-lite-latest
2.4 s
google gemini-flash-lite-latest
Tokens 561
Source code 816 B
Time 2.4 s
kimi moonshot-v1-128k
9.5 s
kimi moonshot-v1-128k
Tokens 340
Source code 995 B
Time 9.5 s
mistral mistral-large-latest
6.2 s
mistral mistral-large-latest
Tokens 282
Source code 762 B
Time 6.2 s
mistral mistral-small-latest
1.8 s
mistral mistral-small-latest
Tokens 182
Source code 363 B
Time 1.8 s
mistral mistral-tiny-latest
2.1 s
mistral mistral-tiny-latest
Tokens 203
Source code 445 B
Time 2.1 s
openai gpt-4o-mini
4.6 s
openai gpt-4o-mini
Tokens 215
Source code 492 B
Time 4.6 s
openai gpt-5.4-nano
3.8 s
openai gpt-5.4-nano
Tokens 256
Source code 659 B
Time 3.8 s
openai gpt-5.5
8.2 s
openai gpt-5.5
Tokens 242
Source code 601 B
Time 8.2 s
openai gpt-5.5-pro
53.6 s
openai gpt-5.5-pro
Tokens 211
Source code 476 B
Time 53.6 s
productivia matania-latest
3.1 s
productivia matania-latest
Tokens 258
Source code 664 B
Time 3.1 s
xai grok-4-1-fast-non-reasoning
4.0 s
xai grok-4-1-fast-non-reasoning
Tokens 285
Source code 775 B
Time 4.0 s
xai grok-4-1-fast-reasoning
16.3 s
xai grok-4-1-fast-reasoning
Tokens 233
Source code 565 B
Time 16.3 s

Probabilities

text
anthropic claude-haiku-4-5-20251001
3.1 s
anthropic claude-haiku-4-5-20251001
Tokens 344
Source code 742 B
Time 3.1 s
anthropic claude-opus-4-6
8.5 s
anthropic claude-opus-4-6
Tokens 390
Source code 797 B
Time 8.5 s
anthropic claude-opus-4-7
7.0 s
anthropic claude-opus-4-7
Tokens 511
Source code 766 B
Time 7.0 s
anthropic claude-sonnet-4-6
9.9/10 5.8 s
anthropic claude-sonnet-4-6
Tokens 340
Source code 724 B
Time 5.8 s
Matania Judgment
Correctness
10
Rigor
9
Notation
10
Completeness
10
Fidelity
10
Overall
9.88
Review
The model perfectly adhered to all constraints: the Markdown formatting is exact, the length is concise and stays within the limit, and the LaTeX formulas are impeccable. The mathematical reasoning for the 5-door case and the subsequent generalization is both correct and crystal clear.
cohere command-r-08-2024
10.1 s
cohere command-r-08-2024
Tokens 286
Source code 837 B
Time 10.1 s
google gemini-flash-latest
6.4 s
google gemini-flash-latest
Tokens 332
Source code 917 B
Time 6.4 s
google gemini-flash-lite-latest
1.6 s
google gemini-flash-lite-latest
Tokens 311
Source code 748 B
Time 1.6 s
kimi moonshot-v1-128k
5.6 s
kimi moonshot-v1-128k
Tokens 343
Source code 1.0 KB
Time 5.6 s
mistral mistral-large-latest
4.8 s
mistral mistral-large-latest
Tokens 285
Source code 832 B
Time 4.8 s
mistral mistral-small-latest
3.2 s
mistral mistral-small-latest
Tokens 298
Source code 887 B
Time 3.2 s
mistral mistral-tiny-latest
2.7 s
mistral mistral-tiny-latest
Tokens 349
Source code 1.1 KB
Time 2.7 s
openai gpt-4o-mini
4.4 s
openai gpt-4o-mini
Tokens 300
Source code 892 B
Time 4.4 s
openai gpt-5.4-nano
3.1 s
openai gpt-5.4-nano
Tokens 306
Source code 916 B
Time 3.1 s
openai gpt-5.5
10.1 s
openai gpt-5.5
Tokens 284
Source code 831 B
Time 10.1 s
openai gpt-5.5-pro
32.4 s
openai gpt-5.5-pro
Tokens 270
Source code 772 B
Time 32.4 s
productivia matania-latest
2.1 s
productivia matania-latest
Tokens 304
Source code 910 B
Time 2.1 s
xai grok-4-1-fast-non-reasoning
3.9 s
xai grok-4-1-fast-non-reasoning
Tokens 261
Source code 739 B
Time 3.9 s
xai grok-4-1-fast-reasoning
6.9 s
xai grok-4-1-fast-reasoning
Tokens 193
Source code 466 B
Time 6.9 s

Logical sequences

text
anthropic claude-haiku-4-5-20251001
3.7 s
anthropic claude-haiku-4-5-20251001
Tokens 304
Source code 608 B
Time 3.7 s
anthropic claude-opus-4-6
4.4 s
anthropic claude-opus-4-6
Tokens 337
Source code 629 B
Time 4.4 s
anthropic claude-opus-4-7
6.3 s
anthropic claude-opus-4-7
Tokens 470
Source code 606 B
Time 6.3 s
anthropic claude-sonnet-4-6
6.8/10 4.9 s
anthropic claude-sonnet-4-6
Tokens 333
Source code 570 B
Time 4.9 s
Matania Judgment
Correctness
6
Rigor
8
Notation
10
Completeness
10
Fidelity
5
Overall
6.75
Review
The LaTeX notation and structure are excellent, but the model fails heavily on the mathematical fidelity of the 'Look-and-Say' sequence. The subsequent terms provided for this sequence are completely incorrect and do not follow the stated rule. Furthermore, while the formatting is respected, the accuracy of the mathematical content is paramount.
cohere command-r-08-2024
9.0 s
cohere command-r-08-2024
Tokens 225
Source code 558 B
Time 9.0 s
google gemini-flash-latest
4.8 s
google gemini-flash-latest
Tokens 315
Source code 597 B
Time 4.8 s
google gemini-flash-lite-latest
2.0 s
google gemini-flash-lite-latest
Tokens 334
Source code 653 B
Time 2.0 s
kimi moonshot-v1-128k
6.9 s
kimi moonshot-v1-128k
Tokens 325
Source code 958 B
Time 6.9 s
mistral mistral-large-latest
5.0 s
mistral mistral-large-latest
Tokens 282
Source code 784 B
Time 5.0 s
mistral mistral-small-latest
1.6 s
mistral mistral-small-latest
Tokens 197
Source code 443 B
Time 1.6 s
mistral mistral-tiny-latest
1.8 s
mistral mistral-tiny-latest
Tokens 205
Source code 477 B
Time 1.8 s
openai gpt-4o-mini
5.0 s
openai gpt-4o-mini
Tokens 218
Source code 529 B
Time 5.0 s
openai gpt-5.4-nano
2.5 s
openai gpt-5.4-nano
Tokens 247
Source code 644 B
Time 2.5 s
openai gpt-5.5
8.0 s
openai gpt-5.5
Tokens 209
Source code 493 B
Time 8.0 s
openai gpt-5.5-pro
67.0 s
openai gpt-5.5-pro
Tokens 200
Source code 456 B
Time 67.0 s
productivia matania-latest
2.1 s
productivia matania-latest
Tokens 245
Source code 636 B
Time 2.1 s
xai grok-4-1-fast-non-reasoning
4.4 s
xai grok-4-1-fast-non-reasoning
Tokens 217
Source code 525 B
Time 4.4 s
xai grok-4-1-fast-reasoning
12.8 s
xai grok-4-1-fast-reasoning
Tokens 198
Source code 450 B
Time 12.8 s
Code