Sciences

Scientific explanation and reasoning

5 tasks · 18 models tested · 90 results

Science popularization

text
anthropic claude-haiku-4-5-20251001
9.8/10 4.2 s
anthropic claude-haiku-4-5-20251001
Tokens 407
Source code 1.5 KB
Time 4.2 s
Matania Judgment
Accuracy
10
Depth
9
Pedagogy
10
Balance
9
Fidelity
10
Overall
9.75
Review
The model perfectly adheres to all formatting, structural, and length constraints. The pedagogical progression is excellent, moving from a simple metaphor to a rigorous mathematical and conceptual explanation without sacrificing scientific precision.
anthropic claude-opus-4-6
9.8/10 8.1 s
anthropic claude-opus-4-6
Tokens 360
Source code 1.1 KB
Time 8.1 s
Matania Judgment
Accuracy
10
Depth
9
Pedagogy
10
Balance
9
Fidelity
10
Overall
9.75
Review
The model perfectly adheres to all format, structure, and length constraints. The pedagogical progression is excellent, moving from a simple metaphor to a rigorous mathematical description without sacrificing scientific accuracy. The use of technical terminology for the 'Physicist' level is entirely appropriate.
anthropic claude-opus-4-7
9.8/10 8.0 s
anthropic claude-opus-4-7
Tokens 487
Source code 1.1 KB
Time 8.0 s
Matania Judgment
Accuracy
10
Depth
9
Pedagogy
10
Balance
9
Fidelity
10
Overall
9.75
Review
The model perfectly adheres to all prompt constraints: the Markdown format, the three-section structure, and the word count are all respected. The pedagogical progression is exemplary, moving from a simple analogy to a rigorous mathematical description (Hilbert vectors, Bell inequalities) without sacrificing scientific accuracy.
anthropic claude-sonnet-4-6
9.8/10 6.5 s
anthropic claude-sonnet-4-6
Tokens 328
Source code 1016 B
Time 6.5 s
Matania Judgment
Accuracy
10
Depth
9
Pedagogy
10
Balance
9
Fidelity
10
Overall
9.75
Review
The model perfectly adheres to all format, structure, and length constraints. The pedagogical progression is excellent, moving from a simple metaphor to rigorous mathematical formalization (Dirac notation) suitable for a physicist's level. The scientific accuracy is irreproachable, particularly regarding the distinction between correlation and information transfer.
cohere command-r-08-2024
9.4/10 32.1 s
cohere command-r-08-2024
Tokens 375
Source code 1.2 KB
Time 32.1 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
8
Fidelity
10
Overall
9.38
Review
The model perfectly adheres to all requested formatting, structural, and length constraints. The pedagogical progression is excellent, moving from a simple metaphor to rigorous technical language without sacrificing scientific precision. The overall result is concise and meets the prompt's expectations exactly.
google gemini-flash-latest
9.6/10 5.0 s
google gemini-flash-latest
Tokens 262
Source code 1.0 KB
Time 5.0 s
Matania Judgment
Accuracy
10
Depth
9
Pedagogy
10
Balance
8
Fidelity
10
Overall
9.63
Review
The model perfectly adheres to all requested format, structure, and length constraints. The pedagogical progression is excellent, moving from a simple metaphor to a rigorous explanation using the mathematical formalism appropriate for a physicist's level. The scientific accuracy is irreproachable across every level of complexity.
google gemini-flash-lite-latest
9.8/10 1.9 s
google gemini-flash-lite-latest
Tokens 300
Source code 1.3 KB
Time 1.9 s
Matania Judgment
Accuracy
10
Depth
9
Pedagogy
10
Balance
9
Fidelity
10
Overall
9.75
Review
The model strictly adheres to all requested formatting, structural, and length constraints. The pedagogical progression is excellent, moving from a simple metaphor to a rigorous technical explanation (Hilbert space, Bell's inequality) without sacrificing scientific accuracy.
kimi moonshot-v1-128k
9.3/10 7.9 s
kimi moonshot-v1-128k
Tokens 463
Source code 1.5 KB
Time 7.9 s
Matania Judgment
Accuracy
9
Depth
8
Pedagogy
9
Balance
8
Fidelity
10
Overall
9.25
Review
The model perfectly adheres to all requested formatting, structural, and length constraints. The pedagogical progression is excellent, moving from a simple metaphor to a rigorous explanation of non-locality and state independence for the physicist. The scientific concepts are accurate and well-adapted to each target level.
mistral mistral-large-latest
6.3/10 7.3 s
mistral mistral-large-latest
Tokens 426
Source code 1.4 KB
Time 7.3 s
Matania Judgment
Accuracy
9
Depth
8
Pedagogy
9
Balance
8
Fidelity
4
Overall
6.25
Review
The scientific content and pedagogical progression are excellent, perfectly respecting the three levels of complexity. However, the model fails heavily on prompt adherence: the text is truncated at the end of the Physicist section, which constitutes a major breach of the completeness instruction, and the word count is slightly exceeded.
mistral mistral-small-latest
6.3/10 2.7 s
mistral mistral-small-latest
Tokens 442
Source code 1.4 KB
Time 2.7 s
Matania Judgment
Accuracy
9
Depth
8
Pedagogy
9
Balance
8
Fidelity
4
Overall
6.25
Review
The scientific and pedagogical quality is excellent, featuring a very well-managed progression across the three levels. However, prompt adherence is severely compromised by a sudden cutoff at the end of the text ('...and spac'), which leaves the final paragraph incomplete and fails to respect the structural constraint.
mistral mistral-tiny-latest
9.0/10 1.8 s
mistral mistral-tiny-latest
Tokens 286
Source code 859 B
Time 1.8 s
Matania Judgment
Accuracy
9
Depth
7
Pedagogy
9
Balance
7
Fidelity
10
Overall
9
Review
The model perfectly adheres to all requested formatting, structural, and length constraints. The levels of complexity are clearly distinguished; although the 'Physicist' section remains somewhat superficial for a true expert, it nonetheless achieves the goal of tiered popularization.
openai gpt-4o-mini
9.3/10 5.1 s
openai gpt-4o-mini
Tokens 318
Source code 984 B
Time 5.1 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.25
Review
The model perfectly adheres to all requested format, structure, and length constraints. The progression of complexity levels is excellent, moving from a simple analogy to precise technical language without sacrificing clarity. The scientific accuracy is impeccable for such concise popularization.
openai gpt-5.4-nano
9.8/10 4.6 s
openai gpt-5.4-nano
Tokens 432
Source code 1.4 KB
Time 4.6 s
Matania Judgment
Accuracy
10
Depth
9
Pedagogy
10
Balance
9
Fidelity
10
Overall
9.75
Review
The model perfectly adheres to all format, structure, and length constraints. The pedagogical progression is excellent, moving from a simple analogy to a rigorous mathematical description (non-separability, tensor product) without sacrificing scientific accuracy.
openai gpt-5.5
9.8/10 9.1 s
openai gpt-5.5
Tokens 318
Source code 986 B
Time 9.1 s
Matania Judgment
Accuracy
10
Depth
9
Pedagogy
10
Balance
9
Fidelity
10
Overall
9.75
Review
The model perfectly adheres to all requested format, structure, and length constraints. The pedagogical progression is exemplary, moving from a simple metaphor to a rigorous explanation of non-factorizability and Bell inequalities. The scientific accuracy is irreproachable across every level of reading.
openai gpt-5.5-pro
9.8/10 26.7 s
openai gpt-5.5-pro
Tokens 326
Source code 1017 B
Time 26.7 s
Matania Judgment
Accuracy
10
Depth
9
Pedagogy
10
Balance
9
Fidelity
10
Overall
9.75
Review
The model perfectly adheres to all formatting, structural, and length constraints. The pedagogical progression is excellent, moving from a simple analogy to a rigorous mathematical definition (non-separability, Bell inequalities) without sacrificing scientific accuracy.
productivia matania-latest
2.6 s
productivia matania-latest
Tokens 741
Source code 1.2 KB
Time 2.6 s
xai grok-4-1-fast-non-reasoning
9.8/10 5.4 s
xai grok-4-1-fast-non-reasoning
Tokens 400
Source code 1.3 KB
Time 5.4 s
Matania Judgment
Accuracy
10
Depth
9
Pedagogy
10
Balance
9
Fidelity
10
Overall
9.75
Review
The model perfectly adhered to all formatting, structural, and length constraints. The pedagogical progression is excellent, moving from a simple metaphor to a rigorous and precise mathematical formulation suitable for a physicist. The scientific accuracy is irreproachable across all three levels.
xai grok-4-1-fast-reasoning
9.3/10 8.3 s
xai grok-4-1-fast-reasoning
Tokens 292
Source code 880 B
Time 8.3 s
Matania Judgment
Accuracy
9
Depth
8
Pedagogy
9
Balance
8
Fidelity
10
Overall
9.25
Review
The model scrupulously adheres to all formatting, structural, and length constraints. The progression of complexity is excellent, moving from a simple analogy to a rigorous and precise mathematical formulation suitable for a physicist. Scientific accuracy is maintained at every stage despite the imposed conciseness.

Experimental protocol

text
anthropic claude-haiku-4-5-20251001
9.1/10 3.6 s
anthropic claude-haiku-4-5-20251001
Tokens 291
Source code 867 B
Time 3.6 s
Matania Judgment
Accuracy
9
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.13
Review
The model perfectly adheres to all format, structure, and length constraints imposed by the prompt. The protocol is scientifically coherent, including relevant control variables and precise measurements such as dry biomass. Although the nuance regarding scientific uncertainty could be strengthened, the exercise is an exemplary success for such a concise task.
anthropic claude-opus-4-6
9.3/10 7.1 s
anthropic claude-opus-4-6
Tokens 339
Source code 933 B
Time 7.1 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.25
Review
The model perfectly adheres to all format, structure, and length constraints imposed by the prompt. The scientific rigor is excellent, particularly with the inclusion of variable standardization and statistical analysis (t-test). Although the nuance regarding the uncertainty of the music's effect is limited by the requested brevity, the protocol is coherent and very well executed.
anthropic claude-opus-4-7
9.3/10 8.1 s
anthropic claude-opus-4-7
Tokens 500
Source code 954 B
Time 8.1 s
Matania Judgment
Accuracy
9
Depth
8
Pedagogy
9
Balance
8
Fidelity
10
Overall
9.25
Review
The model perfectly adhered to all formatting, structural, and length constraints. The protocol is scientifically rigorous, including a control group and controlled confounding variables. The addition of a rock music group and the use of ANOVA demonstrate an excellent understanding of experimental methodology.
anthropic claude-sonnet-4-6
8.9/10 6.7 s
anthropic claude-sonnet-4-6
Tokens 305
Source code 926 B
Time 6.7 s
Matania Judgment
Accuracy
9
Depth
7
Pedagogy
8
Balance
7
Fidelity
10
Overall
8.88
Review
The model perfectly adhered to all formatting, structural, and length constraints imposed by the prompt. The protocol is scientifically coherent, although the 'nuance' section is limited by the imposed brevity. The methodology is rigorous for a short format, including a control group and relevant statistical measures.
cohere command-r-08-2024
8.8/10 28.0 s
cohere command-r-08-2024
Tokens 308
Source code 943 B
Time 28.0 s
Matania Judgment
Accuracy
9
Depth
7
Pedagogy
8
Balance
6
Fidelity
10
Overall
8.75
Review
The model perfectly adheres to all formatting, structural, and length constraints (approximately 145 words). The protocol is scientifically coherent and well-structured, although it lacks nuance regarding the lack of scientific consensus on the subject (balance). Fidelity to the prompt is exemplary.
google gemini-flash-latest
9.5/10 5.8 s
google gemini-flash-latest
Tokens 356
Source code 1.2 KB
Time 5.8 s
Matania Judgment
Accuracy
10
Depth
9
Pedagogy
9
Balance
8
Fidelity
10
Overall
9.5
Review
The model perfectly adhered to all formatting, structural, and length constraints. The protocol is scientifically rigorous, featuring relevant control variables and precise measurements (dry biomass, SPAD). The conciseness is exemplary while maintaining a high level of academic quality.
google gemini-flash-lite-latest
9.1/10 2.0 s
google gemini-flash-lite-latest
Tokens 316
Source code 1.1 KB
Time 2.0 s
Matania Judgment
Accuracy
9
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.13
Review
The model perfectly adheres to all requested formatting, structural, and length constraints. The scientific rigor is excellent, particularly regarding the use of Latin species names and the distinction between control variables. The protocol is coherent and well-structured, although a mention of the study's limitations could have added more nuance.
kimi moonshot-v1-128k
8.9/10 5.6 s
kimi moonshot-v1-128k
Tokens 366
Source code 1.1 KB
Time 5.6 s
Matania Judgment
Accuracy
9
Depth
7
Pedagogy
8
Balance
7
Fidelity
10
Overall
8.88
Review
The model perfectly adhered to all formatting, structural, and length constraints requested by the prompt. The protocol is scientifically coherent and well-structured, although the hypothesis is somewhat vague (it merely predicts an impact without specifying a precise direction). The methodological precision is excellent for such a concise format.
mistral mistral-large-latest
9.1/10 6.0 s
mistral mistral-large-latest
Tokens 335
Source code 1.0 KB
Time 6.0 s
Matania Judgment
Accuracy
9
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.13
Review
The model perfectly adheres to all requested formatting, structural, and length constraints. The protocol is scientifically sound, including relevant control variables and precise measurement methods such as ANOVA and biomass. The only slight weakness lies in the lack of nuance regarding the biological mechanism (balance), but overall, it is an excellent, concise response.
mistral mistral-small-latest
8.9/10 2.2 s
mistral mistral-small-latest
Tokens 286
Source code 854 B
Time 2.2 s
Matania Judgment
Accuracy
9
Depth
7
Pedagogy
8
Balance
7
Fidelity
10
Overall
8.88
Review
The model perfectly adhered to all formatting, structural, and length constraints. The protocol is scientifically coherent and well-structured, although the nuance regarding biological mechanisms (the "balance" aspect) is limited by the brevity imposed by the prompt.
mistral mistral-tiny-latest
8.8/10 1.6 s
mistral mistral-tiny-latest
Tokens 271
Source code 795 B
Time 1.6 s
Matania Judgment
Accuracy
9
Depth
7
Pedagogy
8
Balance
6
Fidelity
10
Overall
8.75
Review
The model perfectly adheres to all requested formatting (Markdown), structural, and length constraints. The protocol is scientifically coherent and well-structured, although it lacks nuance regarding potential confounding variables or the complexity of biological mechanisms.
openai gpt-4o-mini
8.9/10 4.9 s
openai gpt-4o-mini
Tokens 274
Source code 804 B
Time 4.9 s
Matania Judgment
Accuracy
9
Depth
7
Pedagogy
8
Balance
7
Fidelity
10
Overall
8.88
Review
The model perfectly adheres to all formatting, structural, and length constraints requested by the prompt. The protocol is scientifically coherent and well-structured, although it remains somewhat superficial regarding the underlying biological mechanisms due to the conciseness constraint. The methodology is clear, and the control variables are correctly identified.
openai gpt-5.4-nano
9.3/10 3.0 s
openai gpt-5.4-nano
Tokens 354
Source code 1.1 KB
Time 3.0 s
Matania Judgment
Accuracy
9
Depth
8
Pedagogy
9
Balance
8
Fidelity
10
Overall
9.25
Review
The model strictly adheres to all prompt constraints: Markdown format, specific sections, bulleted structure, and conciseness. The protocol is scientifically robust, including a relevant control group (sham audio) and essential control variables, while remaining within the approximate length limit.
openai gpt-5.5
9.0/10 5.3 s
openai gpt-5.5
Tokens 320
Source code 990 B
Time 5.3 s
Matania Judgment
Accuracy
9
Depth
7
Pedagogy
9
Balance
7
Fidelity
10
Overall
9
Review
The model scrupulously adheres to all requested formatting (Markdown), structural, and conciseness constraints. The protocol is scientifically coherent, featuring a control group and controlled variables, although the depth regarding biological mechanisms is limited by the brevity constraint.
openai gpt-5.5-pro
8.9/10 18.3 s
openai gpt-5.5-pro
Tokens 320
Source code 989 B
Time 18.3 s
Matania Judgment
Accuracy
9
Depth
7
Pedagogy
8
Balance
7
Fidelity
10
Overall
8.88
Review
The model perfectly adheres to all requested formatting, structural, and length constraints. The protocol is scientifically coherent and well-structured, although it remains somewhat superficial regarding biological mechanisms for an in-depth research protocol. The application of control variables is exemplary for a short format.
productivia matania-latest
9.3/10 2.9 s
productivia matania-latest
Tokens 749
Source code 1.2 KB
Time 2.9 s
Matania Judgment
Accuracy
9
Depth
8
Pedagogy
9
Balance
8
Fidelity
10
Overall
9.25
Review
The model perfectly adheres to all requested formatting, structural, and length constraints. The content is scientifically rigorous, proposing control variables and precise measurements (biomass, size, vigor index) that go beyond simple superficial observation. The hypothesis is well-formulated, integrating the biological mechanism of mechanical vibrations.
xai grok-4-1-fast-non-reasoning
9.1/10 4.0 s
xai grok-4-1-fast-non-reasoning
Tokens 310
Source code 951 B
Time 4.0 s
Matania Judgment
Accuracy
9
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.13
Review
The model perfectly adheres to all requested formatting, structural, and length constraints. The protocol is scientifically sound, including control variables and an appropriate statistical test. The only slight weakness lies in the lack of nuance regarding the exact nature of the influence (vibrations vs. music), but it remains excellent for a short format.
xai grok-4-1-fast-reasoning
9.1/10 6.6 s
xai grok-4-1-fast-reasoning
Tokens 252
Source code 719 B
Time 6.6 s
Matania Judgment
Accuracy
9
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.13
Review
The model perfectly adheres to all requested format, structure, and length constraints. The protocol is scientifically coherent, including relevant control variables and statistical methods. Although the nuance regarding biological mechanisms (vibrations vs. stress) is concise, the response is highly effective given the required short format.

Scientific debate

text
anthropic claude-haiku-4-5-20251001
9.6/10 3.6 s
anthropic claude-haiku-4-5-20251001
Tokens 314
Source code 1.1 KB
Time 3.6 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
10
Fidelity
10
Overall
9.63
Review
The model perfectly adheres to all requested formatting, structural, and length constraints. The content is scientifically accurate, balanced, and provides a concise yet relevant analysis of the challenges surrounding nuclear energy. The addition of a conclusion is a slight departure from the prompt, but this in no way diminishes the quality of the execution.
anthropic claude-opus-4-6
9.6/10 6.2 s
anthropic claude-opus-4-6
Tokens 312
Source code 1.0 KB
Time 6.2 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
10
Fidelity
10
Overall
9.63
Review
The model scrupulously adheres to all formatting (Markdown), structural, and length constraints. The content is scientifically accurate, balanced, and provides a concise yet relevant analysis of nuclear issues. Prompt fidelity is exemplary.
anthropic claude-opus-4-7
9.6/10 7.8 s
anthropic claude-opus-4-7
Tokens 494
Source code 1.1 KB
Time 7.8 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
10
Fidelity
10
Overall
9.63
Review
The model adheres perfectly to all prompt constraints: the Markdown formatting is exact, the number of points per section is respected, and the conciseness is optimal. The content is scientifically rigorous, balanced, and provides high-quality analysis despite the imposed brevity.
anthropic claude-sonnet-4-6
9.6/10 5.4 s
anthropic claude-sonnet-4-6
Tokens 303
Source code 1.0 KB
Time 5.4 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
10
Fidelity
10
Overall
9.63
Review
The model perfectly adhered to all formatting, structural, and length constraints (approximately 145 words). The arguments are scientifically accurate, balanced, and demonstrate excellent pedagogical clarity. Prompt fidelity is exemplary.
cohere command-r-08-2024
9.4/10 9.1 s
cohere command-r-08-2024
Tokens 347
Source code 1.1 KB
Time 9.1 s
Matania Judgment
Accuracy
10
Depth
7
Pedagogy
9
Balance
9
Fidelity
10
Overall
9.38
Review
The model perfectly adhered to all requested formatting, structural, and length constraints. The content is scientifically accurate and offers a very satisfying balance between pros and cons. Although the analysis remains somewhat superficial due to the conciseness constraint, it perfectly fulfills the established pedagogical objective.
google gemini-flash-latest
9.6/10 6.3 s
google gemini-flash-latest
Tokens 342
Source code 1.4 KB
Time 6.3 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
10
Fidelity
10
Overall
9.63
Review
The model perfectly adhered to all formatting, structural, and length constraints (approximately 170 words, which remains very close to the conciseness requirement). The content is scientifically accurate, balanced, and addresses crucial nuances such as water dependency and energy density. The writing quality is excellent for a synthetic format.
google gemini-flash-lite-latest
9.6/10 2.1 s
google gemini-flash-lite-latest
Tokens 261
Source code 1.1 KB
Time 2.1 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
10
Fidelity
10
Overall
9.63
Review
The model perfectly adhered to all prompt constraints: the exact Markdown format, the requested number of bullet points, structure, and conciseness (approximately 140 words). The content is scientifically rigorous, balanced, and provides a relevant analysis of the challenges surrounding nuclear energy.
kimi moonshot-v1-128k
9.4/10 5.0 s
kimi moonshot-v1-128k
Tokens 321
Source code 1.0 KB
Time 5.0 s
Matania Judgment
Accuracy
10
Depth
7
Pedagogy
9
Balance
9
Fidelity
10
Overall
9.38
Review
The model strictly adheres to all requested formatting, structural, and length constraints. The content is scientifically accurate and well-balanced, although the analysis remains superficial due to the conciseness constraints imposed by the prompt.
mistral mistral-large-latest
5.5/10 5.2 s
mistral mistral-large-latest
Tokens 364
Source code 1.2 KB
Time 5.2 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
9
Fidelity
2
Overall
5.5
Review
The content is scientifically accurate and well-structured, but the model failed critically on the fidelity constraint. The text is cut off mid-sentence at the end, failing to complete the final bullet point and violating the expected structural integrity. Additionally, the word count is slightly higher than the requested ~150 words due to this incomplete output.
mistral mistral-small-latest
9.6/10 2.1 s
mistral mistral-small-latest
Tokens 284
Source code 899 B
Time 2.1 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
10
Fidelity
10
Overall
9.63
Review
The model perfectly adhered to all prompt constraints: the exact Markdown format, the requested structure, the specific number of points, and the required conciseness (approximately 140 words). The content is scientifically accurate, balanced, and provides a relevant analysis of nuclear energy issues without being simplistic.
mistral mistral-tiny-latest
9.6/10 3.2 s
mistral mistral-tiny-latest
Tokens 376
Source code 1.2 KB
Time 3.2 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
10
Fidelity
10
Overall
9.63
Review
The model perfectly adhered to all prompt constraints: Markdown formatting, heading structure, number of bullet points per section, and conciseness. The content is scientifically accurate, balanced, and provides an excellent synthesis of the issues surrounding nuclear energy.
openai gpt-4o-mini
9.5/10 17.8 s
openai gpt-4o-mini
Tokens 362
Source code 1.2 KB
Time 17.8 s
Matania Judgment
Accuracy
10
Depth
7
Pedagogy
9
Balance
10
Fidelity
10
Overall
9.5
Review
The model perfectly adhered to all requested formatting, structural, and length constraints. The content is scientifically accurate, balanced, and provides an excellent synthesis despite the imposed brevity. The depth is slightly limited by the need for conciseness, but remains highly relevant to the task.
openai gpt-5.4-nano
9.6/10 3.8 s
openai gpt-5.4-nano
Tokens 376
Source code 1.2 KB
Time 3.8 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
10
Fidelity
10
Overall
9.63
Review
The model perfectly adhered to all prompt constraints: Markdown formatting, heading structure, number of bullet points, and conciseness. The content is scientifically accurate, nuanced, and provides an excellent synthesis of current energy issues.
openai gpt-5.5
9.6/10 7.8 s
openai gpt-5.5
Tokens 333
Source code 1.1 KB
Time 7.8 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
10
Fidelity
10
Overall
9.63
Review
The model perfectly adhered to all prompt constraints: the Markdown format, the number of bullet points per section, conciseness, and the requested structure. The content is scientifically accurate, balanced, and provides a relevant analysis despite the imposed word limit.
openai gpt-5.5-pro
9.6/10 37.0 s
openai gpt-5.5-pro
Tokens 320
Source code 1.0 KB
Time 37.0 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
10
Fidelity
10
Overall
9.63
Review
The model scrupulously adheres to all prompt constraints: Markdown format, number of points per section, conciseness, and structure. The content is scientifically accurate, balanced, and provides a relevant analysis despite the imposed word limit.
productivia matania-latest
9.6/10 2.5 s
productivia matania-latest
Tokens 721
Source code 1.2 KB
Time 2.5 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
10
Fidelity
10
Overall
9.63
Review
The model perfectly adheres to all prompt constraints: the Markdown formatting is exact, the number of bullet points is respected, and the length is concise. The content is scientifically rigorous, balanced, and demonstrates excellent synthesis quality for the subject matter.
xai grok-4-1-fast-non-reasoning
9.6/10 3.4 s
xai grok-4-1-fast-non-reasoning
Tokens 383
Source code 1.3 KB
Time 3.4 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
10
Fidelity
10
Overall
9.63
Review
The model perfectly adhered to all prompt constraints: the Markdown formatting is correct, the number of bullet points per section was respected, and the length is nearly ideal (~140 words). The content is scientifically accurate, balanced, and offers a nuanced analysis of the climate and economic stakes.
xai grok-4-1-fast-reasoning
9.6/10 6.8 s
xai grok-4-1-fast-reasoning
Tokens 253
Source code 773 B
Time 6.8 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
10
Fidelity
10
Overall
9.63
Review
The model perfectly adhered to all formatting (Markdown), structural, and length constraints. The content is scientifically accurate, balanced, and addresses the key points in a concise and relevant manner.

Molecular biology

text
anthropic claude-haiku-4-5-20251001
9.3/10 3.5 s
anthropic claude-haiku-4-5-20251001
Tokens 312
Source code 981 B
Time 3.5 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.25
Review
The model followed all instructions perfectly: the Markdown structure, the number of steps, and the word count constraint were all respected. The scientific explanation is accurate, covering essential components such as PAM and repair mechanisms, while remaining concise and highly readable.
anthropic claude-opus-4-6
9.5/10 6.9 s
anthropic claude-opus-4-6
Tokens 340
Source code 1009 B
Time 6.9 s
Matania Judgment
Accuracy
10
Depth
9
Pedagogy
10
Balance
7
Fidelity
10
Overall
9.5
Review
The model followed all instructions perfectly: Markdown structure, number of steps (6), and word count constraints were strictly respected. The scientific content is highly accurate, including crucial details like the PAM sequence and the two repair pathways (NHEJ/HDR). While it lacks a discussion on ethical debates (balance), this is justified by the prompt's requirement for conciseness.
anthropic claude-opus-4-7
9.6/10 6.2 s
anthropic claude-opus-4-7
Tokens 473
Source code 962 B
Time 6.2 s
Matania Judgment
Accuracy
10
Depth
9
Pedagogy
10
Balance
8
Fidelity
10
Overall
9.63
Review
The model perfectly adhered to all formatting, structural, and length constraints. The scientific accuracy is excellent, including crucial details such as the PAM motif and repair mechanisms (NHEJ/HDR). The content is concise, educational, and meets the prompt's expectations exactly.
anthropic claude-sonnet-4-6
9.3/10 5.8 s
anthropic claude-sonnet-4-6
Tokens 309
Source code 873 B
Time 5.8 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.25
Review
The model perfectly adhered to all formatting constraints (Markdown, title, 6-step numbered list) and length requirements. The scientific content is highly precise, including essential nuances such as the role of the PAM and the distinction between NHEJ and HDR repair mechanisms. The conciseness is exemplary while remaining educational.
cohere command-r-08-2024
9.3/10 33.4 s
cohere command-r-08-2024
Tokens 295
Source code 971 B
Time 33.4 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.25
Review
The model perfectly adheres to all formatting constraints (Markdown, title, numbered list) and length requirements. The content is scientifically rigorous, correctly explaining the distinction between NHEJ and HDR. The structure is concise and precisely meets the established pedagogical objective.
google gemini-flash-latest
9.3/10 4.7 s
google gemini-flash-latest
Tokens 255
Source code 890 B
Time 4.7 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.25
Review
The model perfectly adheres to all formatting constraints (Markdown, title, numbered list) and length requirements. The content is scientifically rigorous, including crucial details such as the PAM sequence, while remaining concise and educational. The only slight weakness lies in the lack of nuance regarding off-target mutation risks, but this remains consistent with the requested constraint of conciseness.
google gemini-flash-lite-latest
9.3/10 2.3 s
google gemini-flash-lite-latest
Tokens 236
Source code 829 B
Time 2.3 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.25
Review
The model followed every instruction perfectly: the Markdown structure, the number of steps (5), and the word count constraints were all respected. Scientifically, the explanation is precise and covers both NHEJ and HDR mechanisms implicitly in the final step. The only minor omission is a discussion of ethical nuances, but given the strict "concise" and "steps" constraints, this was a logical trade-off.
kimi moonshot-v1-128k
9.3/10 7.7 s
kimi moonshot-v1-128k
Tokens 344
Source code 1.1 KB
Time 7.7 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.25
Review
The model perfectly adhered to all formatting constraints (Markdown, title, 6-step numbered list) and length requirements. The scientific content is extremely accurate, correctly explaining the crucial distinction between NHEJ and HDR. The conciseness is exemplary without sacrificing pedagogical clarity.
mistral mistral-large-latest
9.3/10 4.7 s
mistral mistral-large-latest
Tokens 303
Source code 1002 B
Time 4.7 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.25
Review
The model followed all instructions perfectly: it used the correct Markdown header, provided exactly 6 steps, and stayed within a concise word count. The scientific explanation is highly accurate, covering both NHEJ and HDR mechanisms, which adds necessary depth for a short summary.
mistral mistral-small-latest
9.3/10 2.1 s
mistral mistral-small-latest
Tokens 240
Source code 748 B
Time 2.1 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.25
Review
The model perfectly adheres to all constraints regarding format, structure (Markdown title and numbered list), and length. The scientific accuracy is excellent, correctly detailing the repair mechanisms (NHEJ/HDR). Conciseness is well-managed without sacrificing pedagogical clarity.
mistral mistral-tiny-latest
9.3/10 1.6 s
mistral mistral-tiny-latest
Tokens 247
Source code 777 B
Time 1.6 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.25
Review
The model perfectly adheres to all formatting constraints (Markdown, numbered list, length, and conciseness). The scientific explanation is accurate, covering the key steps of the Cas9 mechanism and the modes of cellular repair. The depth is sufficient for a concise response, although the nuance regarding the technology's limitations is somewhat underdeveloped in order to respect the word count.
openai gpt-4o-mini
9.3/10 5.0 s
openai gpt-4o-mini
Tokens 277
Source code 898 B
Time 5.0 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.25
Review
The model scrupulously adheres to all formatting constraints (Markdown, title, 6-step numbered list) and length requirements. The scientific accuracy is excellent, correctly covering cleavage mechanisms and the two repair pathways (NHEJ and HDR). The content is concise and perfectly tailored to the prompt.
openai gpt-5.4-nano
9.3/10 4.0 s
openai gpt-5.4-nano
Tokens 270
Source code 870 B
Time 4.0 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.25
Review
The model perfectly followed all constraints: the Markdown structure, the numbered list of 6 steps, and the word count (~140 words) are all spot on. Scientifically, the explanation is highly accurate, correctly identifying the importance of the PAM motif and the distinction between NHEJ and HDR repair mechanisms. The depth is appropriate for the requested conciseness.
openai gpt-5.5
9.3/10 5.9 s
openai gpt-5.5
Tokens 281
Source code 915 B
Time 5.9 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.25
Review
The model strictly adhered to all constraints: Markdown structure, numbering, length, and conciseness. Scientific accuracy is excellent, providing a clear and correct overview of the CRISPR mechanism. While the depth is limited by the requested brevity, it remains highly pedagogical and precise.
openai gpt-5.5-pro
9.3/10 69.0 s
openai gpt-5.5-pro
Tokens 250
Source code 789 B
Time 69.0 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.25
Review
The model strictly adheres to all formatting constraints (Markdown, title, 6-step numbered list) and length requirements. The scientific accuracy is excellent, correctly covering the roles of guide RNA, the Cas9 protein, and cellular repair mechanisms. The content is concise and educational, although it does not address ethical debates in order to remain within the requested word count.
productivia matania-latest
9.3/10 2.5 s
productivia matania-latest
Tokens 706
Source code 1.2 KB
Time 2.5 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.25
Review
The model perfectly adhered to all formatting constraints (Markdown, title, 5-step numbered list) and length requirements. The scientific accuracy is impeccable, including crucial details such as the PAM motif and repair mechanisms (NHEJ/HDR) in a concise manner. The nuance regarding the two repair modes adds good depth despite the imposed brevity.
xai grok-4-1-fast-non-reasoning
9.3/10 4.0 s
xai grok-4-1-fast-non-reasoning
Tokens 285
Source code 928 B
Time 4.0 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.25
Review
The model strictly followed all formatting and length constraints, including the Markdown header and the specific number of steps. The scientific content is highly accurate, covering essential mechanisms like PAM recognition and repair pathways (NHEJ/HDR). While it lacks deep nuance regarding off-target effects, it perfectly meets the 'concise' requirement of the prompt.
xai grok-4-1-fast-reasoning
9.3/10 7.3 s
xai grok-4-1-fast-reasoning
Tokens 231
Source code 714 B
Time 7.3 s
Matania Judgment
Accuracy
10
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.25
Review
The model followed all structural and length constraints perfectly, including the Markdown headers and the specific step count. Scientifically, the description is highly accurate, correctly identifying the PAM motif and the two main repair pathways (NHEJ and HDR). While it lacks a discussion on off-target effects (balance), it fulfills the 'concise' requirement of the prompt exceptionally well.

Astronomy

text
anthropic claude-haiku-4-5-20251001
7.1/10 6.7 s
anthropic claude-haiku-4-5-20251001
Tokens 378
Source code 1.6 KB
Time 6.7 s
Matania Judgment
Accuracy
9
Depth
8
Pedagogy
9
Balance
7
Fidelity
6
Overall
7.13
Review
The scientific accuracy and writing quality are excellent, providing a deep and educational analysis of the consequences. However, the model fails on crucial formatting constraints: it uses an H1 header instead of the requested H2 (## Title) and significantly exceeds the word count (~230 words instead of the requested ~150), which heavily penalizes the faithfulness score.
anthropic claude-opus-4-6
9.3/10 8.6 s
anthropic claude-opus-4-6
Tokens 337
Source code 1.2 KB
Time 8.6 s
Matania Judgment
Accuracy
9
Depth
8
Pedagogy
9
Balance
8
Fidelity
10
Overall
9.25
Review
The model perfectly adheres to all formatting constraints (Markdown, headings, structure) and length requirements (approximately 160 words, which is very close to the conciseness instruction). The scientific accuracy is excellent, effectively capturing the water vapor feedback effect. The analysis is profound and impactful despite the imposed brevity.
anthropic claude-opus-4-7
9.3/10 9.1 s
anthropic claude-opus-4-7
Tokens 566
Source code 1.3 KB
Time 9.1 s
Matania Judgment
Accuracy
9
Depth
8
Pedagogy
9
Balance
8
Fidelity
10
Overall
9.25
Review
The model perfectly adheres to all requested formatting, structural, and length constraints. The scientific accuracy is excellent, particularly regarding the impact of climate feedback loops. The text is concise, impactful, and perfectly suited to the astronomy/climatology task.
anthropic claude-sonnet-4-6
9.5/10 7.5 s
anthropic claude-sonnet-4-6
Tokens 374
Source code 1.3 KB
Time 7.5 s
Matania Judgment
Accuracy
10
Depth
9
Pedagogy
9
Balance
8
Fidelity
10
Overall
9.5
Review
The model strictly adheres to all requested formatting, structure, and length constraints. The scientific accuracy is excellent, particularly regarding energy intake estimation and feedback mechanisms. The text is concise, impactful, and perfectly aligned with the prompt's expectations.
cohere command-r-08-2024
7.5/10 11.8 s
cohere command-r-08-2024
Tokens 474
Source code 1.6 KB
Time 11.8 s
Matania Judgment
Accuracy
9
Depth
7
Pedagogy
8
Balance
8
Fidelity
7
Overall
7.5
Review
The scientific content is solid and well-structured. However, prompt adherence is penalized by the failure to respect the length constraint: the text is approximately 230 words, whereas the prompt required about 150 words (~150 words). The Markdown formatting is correctly applied.
google gemini-flash-latest
9.1/10 7.4 s
google gemini-flash-latest
Tokens 286
Source code 1.2 KB
Time 7.4 s
Matania Judgment
Accuracy
9
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.13
Review
The model perfectly adheres to all requested formatting, structural, and length constraints. The scientific analysis is solid, although the scenario of a 'runaway' greenhouse effect might be debatable for an increase of only 10%; nonetheless, the text remains coherent and very well-structured.
google gemini-flash-lite-latest
9.1/10 2.4 s
google gemini-flash-lite-latest
Tokens 350
Source code 1.5 KB
Time 2.4 s
Matania Judgment
Accuracy
9
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.13
Review
The model scrupulously adheres to all requested formatting, structural, and length constraints. The scientific accuracy is excellent, although the analysis is heavily skewed toward a doomsday scenario without exploring the nuances of complex climate feedback loops. The response is concise, impactful, and perfectly aligned with the prompt.
kimi moonshot-v1-128k
8.9/10 5.6 s
kimi moonshot-v1-128k
Tokens 372
Source code 1.2 KB
Time 5.6 s
Matania Judgment
Accuracy
9
Depth
7
Pedagogy
8
Balance
7
Fidelity
10
Overall
8.88
Review
The model perfectly adheres to all formatting constraints (Markdown, headings, sections) and length requirements. From a scientific standpoint, the consequences are correctly identified, although the analysis remains somewhat superficial in order to stay within the imposed word limit. The response is concise and addresses the problem statement directly.
mistral mistral-large-latest
6.3/10 6.3 s
mistral mistral-large-latest
Tokens 413
Source code 1.3 KB
Time 6.3 s
Matania Judgment
Accuracy
9
Depth
8
Pedagogy
9
Balance
8
Fidelity
4
Overall
6.25
Review
The scientific accuracy and writing quality are excellent. However, the model fails significantly on prompt adherence: the text is truncated at the end (incomplete sentence), which violates the expected structure, and it slightly exceeds the imposed conciseness constraint.
mistral mistral-small-latest
7.3/10 2.8 s
mistral mistral-small-latest
Tokens 400
Source code 1.3 KB
Time 2.8 s
Matania Judgment
Accuracy
9
Depth
8
Pedagogy
9
Balance
8
Fidelity
6
Overall
7.25
Review
The scientific content is accurate and well-structured, providing a good depth of analysis for the subject. However, the model failed the fidelity constraint regarding length: the text is cut off mid-sentence at the end, and it exceeds the requested ~150 words (it's closer to 200 words). The truncation is a significant penalty for fidelity.
mistral mistral-tiny-latest
4.9/10 1.9 s
mistral mistral-tiny-latest
Tokens 298
Source code 907 B
Time 1.9 s
Matania Judgment
Accuracy
6
Depth
5
Pedagogy
7
Balance
5
Fidelity
4
Overall
4.88
Review
The model fails on faithfulness because it uses bullet points instead of the requested paragraphs and significantly exceeds the conciseness constraint (~150 words). From a scientific standpoint, the estimated temperature increase (1°C) is heavily underestimated for a 10% increase in luminosity, which compromises the accuracy of the analysis.
openai gpt-4o-mini
8.9/10 4.9 s
openai gpt-4o-mini
Tokens 349
Source code 1.1 KB
Time 4.9 s
Matania Judgment
Accuracy
9
Depth
7
Pedagogy
8
Balance
7
Fidelity
10
Overall
8.88
Review
The model perfectly adheres to all formatting, structural, and length constraints (approximately 160 words, very close to the target). The scientific accuracy is solid, although the analysis remains relatively superficial to respect the requested conciseness. The response is balanced and educational, though it lacks nuance regarding complex feedback loops.
openai gpt-5.4-nano
6.3/10 3.3 s
openai gpt-5.4-nano
Tokens 585
Source code 2.0 KB
Time 3.3 s
Matania Judgment
Accuracy
9
Depth
8
Pedagogy
9
Balance
8
Fidelity
4
Overall
6.25
Review
Scientific accuracy and writing quality are excellent, providing a relevant analysis of feedbacks (albedo, water vapor). However, the model failed on the most critical length constraint: the text is approximately 350 words, significantly exceeding the requested limit of ~150 words. This major failure to remain concise heavily penalizes the faithfulness score.
openai gpt-5.5
9.5/10 10.6 s
openai gpt-5.5
Tokens 379
Source code 1.2 KB
Time 10.6 s
Matania Judgment
Accuracy
10
Depth
9
Pedagogy
9
Balance
8
Fidelity
10
Overall
9.5
Review
The model perfectly adheres to all formatting, structural, and length constraints. The scientific accuracy is excellent, particularly with the inclusion of the radiative forcing value. The analysis is profound and well-structured, although the nuance regarding climate uncertainties could have been slightly more pronounced.
openai gpt-5.5-pro
9.5/10 155.8 s
openai gpt-5.5-pro
Tokens 321
Source code 999 B
Time 155.8 s
Matania Judgment
Accuracy
10
Depth
9
Pedagogy
9
Balance
8
Fidelity
10
Overall
9.5
Review
The model perfectly followed all constraints, including Markdown formatting, specific section headers, and the word count constraint (approx. 160 words, which is very close to the ~150 target). The scientific accuracy is excellent, correctly identifying the radiative forcing implications and the resulting biological and societal consequences.
productivia matania-latest
9.0/10 2.6 s
productivia matania-latest
Tokens 777
Source code 1.4 KB
Time 2.6 s
Matania Judgment
Accuracy
9
Depth
8
Pedagogy
8
Balance
7
Fidelity
10
Overall
9
Review
The model perfectly adheres to all requested formatting, structural, and length constraints. The scientific accuracy is solid, although a nuance regarding the distinction between solar flux and surface temperature could have enriched the analysis. The content is concise, impactful, and precisely addresses the problem posed.
xai grok-4-1-fast-non-reasoning
9.1/10 4.3 s
xai grok-4-1-fast-non-reasoning
Tokens 433
Source code 1.4 KB
Time 4.3 s
Matania Judgment
Accuracy
9
Depth
8
Pedagogy
9
Balance
7
Fidelity
10
Overall
9.13
Review
The model perfectly adheres to all formatting, structural, and length constraints (approximately 180 words, slightly over but remaining very concise). Scientifically, the analysis is robust, although the "runaway greenhouse" effect is an extreme extrapolation for a 10% increase; however, the text remains consistent with the scale of the described catastrophe. The Markdown structure is applied with exemplary precision.
xai grok-4-1-fast-reasoning
9.3/10 6.5 s
xai grok-4-1-fast-reasoning
Tokens 333
Source code 1.0 KB
Time 6.5 s
Matania Judgment
Accuracy
9
Depth
8
Pedagogy
9
Balance
8
Fidelity
10
Overall
9.25
Review
The model perfectly adheres to all requested formatting, structural, and length constraints. The scientific accuracy is excellent, particularly regarding feedback mechanisms (albedo, water vapor) and biological consequences. The content is concise, information-dense, and perfectly aligned with the prompt instructions.
Code