What benchmarks are used in the LLM leaderboard?

The leaderboard uses multiple benchmarks including MMLU (Massive Multitask Language Understanding), GPQA (Graduate-Level Physics Q&A), MMMU (Massive Multimodal Understanding), SWE-Bench (Software Engineering Benchmark), Terminal-Bench, TAU-Bench (Thinking and Understanding Benchmark), Multilingual capabilities, and AIME2025 (AI for Mathematical Olympiad).

How is the overall score calculated?

The overall score is a weighted average of all benchmarks, normalized to a 0-100 scale. It provides a single, at-a-glance measure of a model's overall capability across different domains.

Which LLM are included in the leaderboard?

The leaderboard includes top-tier models from major AI companies including OpenAI (ChatGPT series), Anthropic (Claude series), Google (Gemini series), Meta (Llama series), Alibaba (Qwen series), xAI (Grok series), Mistral AI, and DeepSeek.

LLM Leaderboard

Gemini 3 Pro

Google's flagship model with exceptional multimodal capabilities and massive context window.

Top Tier Reasoning

Overall Score:

97.6

MMLU:

93.4

MMMU:

87.6

GPQA:

91.9

Coding:

65.2

TAU-Bench:

80.0

Multilingual:

91.8

AIME 2025:

95.0

Claude 4.5 Opus

Anthropic's most powerful model with exceptional reasoning and creative capabilities.

Top Tier Reasoning

Overall Score:

96.8

MMLU:

90.8

MMMU:

80.7

GPQA:

87.0

Coding:

70.1

TAU-Bench:

82.4

Multilingual:

90.0

AIME 2025:

87.0

ChatGPT 5.1

OpenAI's most advanced reasoning model with breakthrough performance in complex problem-solving.

Great for Creative Tasks

Overall Score:

93.1

MMLU:

84.6

MMMU:

80.4

GPQA:

85.7

Coding:

62.0

TAU-Bench:

81.1

Multilingual:

91.0

AIME 2025:

94.0

Claude 4.5 Sonnet

Anthropic's most powerful model with exceptional reasoning and creative capabilities.

Top Tier Reasoning

Overall Score:

92.8

MMLU:

89.1

MMMU:

77.8

GPQA:

83.4

Coding:

63.6

TAU-Bench:

81.4

Multilingual:

90.0

AIME 2025:

87.0

ChatGPT 5

OpenAI's most advanced reasoning model with breakthrough performance in complex problem-solving.

Top Tier Reasoning

Overall Score:

90.8

MMLU:

84.6

MMMU:

84.2

GPQA:

85.7

Coding:

56.5

TAU-Bench:

81.1

Multilingual:

88.8

AIME 2025:

92.6

Gemini 2.5 Pro

Google's flagship model with exceptional multimodal capabilities and massive context window.

Top Tier Reasoning

Overall Score:

89.5

MMLU:

89.8

MMMU:

84.0

GPQA:

88.4

Coding:

52.3

TAU-Bench:

80.0

Multilingual:

89.0

AIME 2025:

89.0

Claude 4.1 Opus

Anthropic's most powerful model with exceptional reasoning and creative capabilities.

Best for Coding

Overall Score:

88.9

MMLU:

88.8

MMMU:

77.1

GPQA:

79.6

Coding:

58.9

TAU-Bench:

82.4

Multilingual:

89.5

AIME 2025:

78.0

Claude 4 Opus

Anthropic's most powerful model with exceptional reasoning and creative capabilities.

Best for Coding

Overall Score:

87.0

MMLU:

88.8

MMMU:

76.5

GPQA:

79.6

Coding:

55.9

TAU-Bench:

81.4

Multilingual:

88.8

AIME 2025:

75.5

ChatGPT o3

OpenAI's most advanced reasoning model with breakthrough performance in complex problem-solving.

Overall Score:

85.7

MMLU:

85.6

MMMU:

82.9

GPQA:

83.3

Coding:

49.6

TAU-Bench:

70.4

Multilingual:

88.8

AIME 2025:

88.9

ChatGPT o1

OpenAI's reasoning model optimized for complex problem-solving, mathematics, and coding tasks.

Overall Score:

83.6

MMLU:

89.3

MMMU:

78.2

GPQA:

78.0

Coding:

50.5

TAU-Bench:

73.5

Multilingual:

82.1

AIME 2025:

79.3

Claude 4 Sonnet

Anthropic's balanced model offering excellent performance across all domains.

Best for Coding

Overall Score:

82.3

MMLU:

74.4

MMMU:

74.4

GPQA:

75.4

Coding:

54.1

TAU-Bench:

80.5

Multilingual:

86.5

AIME 2025:

70.5

Qwen3 480B

Open Source

Alibaba's most powerful Qwen3 model with state-of-the-art performance across all benchmarks.

Overall Score:

81.8

MMLU:

82.3

MMMU:

82.4

GPQA:

78.3

Coding:

47.1

TAU-Bench:

70.9

Multilingual:

80.8

AIME 2025:

83.6

Claude 3.7 Sonnet

Anthropic's most powerful model with exceptional reasoning and creative capabilities.

Great for Creative Tasks

Overall Score:

81.2

MMLU:

88.8

MMMU:

75.0

GPQA:

68.0

Coding:

52.8

TAU-Bench:

81.2

Multilingual:

83.2

AIME 2025:

61.3

Gemini 2.5 Flash

Google's optimized model balancing speed and performance for efficient deployment.

Top Tier Reasoning

Overall Score:

80.3

MMLU:

88.4

MMMU:

79.7

GPQA:

82.8

Coding:

42.6

TAU-Bench:

72.3

Multilingual:

87.2

AIME 2025:

72.0

Mistral Large 3

Mistral AI's most advanced model with superior multilingual and coding performance.

Great for Creative Tasks

Overall Score:

78.3

MMLU:

81.3

MMMU:

73.8

GPQA:

80.2

Coding:

40.9

TAU-Bench:

78.9

Multilingual:

86.4

AIME 2025:

72.6

ChatGPT 4o

OpenAI's omni-modal model with native audio, vision, and text capabilities.

Great for Creative Tasks

Overall Score:

78.3

MMLU:

88.7

MMMU:

69.1

GPQA:

53.6

Coding:

46.7

TAU-Bench:

78.0

Multilingual:

90.1

AIME 2025:

76.6

ChatGPT 4.1

OpenAI's enhanced multimodal model with improved reasoning and efficiency.

Overall Score:

75.2

MMLU:

74.8

MMMU:

71.8

GPQA:

66.3

Coding:

42.5

TAU-Bench:

68.0

Multilingual:

83.7

AIME 2025:

79.5

Claude 3.5 Sonnet

Anthropic's most capable model, excelling at coding, writing, and complex reasoning tasks.

Overall Score:

72.0

MMLU:

88.7

MMMU:

68.3

GPQA:

59.4

Coding:

54.6

TAU-Bench:

71.5

Multilingual:

79.2

AIME 2025:

16.0

DeepSeek-V3

Open Source

DeepSeek's advanced model with strong coding and reasoning capabilities.

Overall Score:

71.4

MMLU:

84.1

MMMU:

65.2

GPQA:

70.9

Coding:

48.6

TAU-Bench:

70.0

Multilingual:

71.8

AIME 2025:

32.0

Gemini 1.5 Pro

Google's advanced model with 2M token context window and strong multimodal capabilities.

Great for Creative Tasks

Overall Score:

68.6

MMLU:

85.9

MMMU:

62.2

GPQA:

63.9

Coding:

36.5

TAU-Bench:

75.7

Multilingual:

88.0

AIME 2025:

40.0

Qwen2.5 72B

Open Source

Alibaba's flagship open-source model with exceptional multilingual and coding capabilities.

Overall Score:

66.8

MMLU:

72.3

MMMU:

75.2

GPQA:

49.8

Coding:

40.8

TAU-Bench:

72.4

Multilingual:

85.0

AIME 2025:

37.0

Llama 3.1 70B

Open Source

Meta's efficient large model offering strong performance with lower computational requirements.

Overall Score:

66.2

MMLU:

79.6

MMMU:

68.9

GPQA:

46.7

Coding:

34.5

TAU-Bench:

73.8

Multilingual:

60.3

AIME 2025:

70.2

DeepSeek-V3

Open Source

DeepSeek's advanced model with strong coding and reasoning capabilities.

Overall Score:

58.8

MMLU:

81.7

MMMU:

65.2

GPQA:

43.9

Coding:

29.4

TAU-Bench:

70.0

Multilingual:

71.8

AIME 2025:

32.0

Grok 4

xAI's most advanced model with real-time information access and enhanced reasoning.

Top Tier Reasoning

Overall Score:

53.0

MMLU:

87.6

MMMU:

72.1

GPQA:

87.5

Coding:

-

TAU-Bench:

-

Multilingual:

83.2

AIME 2025:

74.5

Grok 3

xAI's improved model with enhanced conversational abilities and real-time data access.

Overall Score:

52.9

MMLU:

85.7

MMMU:

76.0

GPQA:

80.2

Coding:

-

TAU-Bench:

-

Multilingual:

80.1

AIME 2025:

83.0

Claude 3 Haiku

Anthropic's fastest model, optimized for speed while maintaining strong capabilities.

Overall Score:

49.9

MMLU:

75.2

MMMU:

46.4

GPQA:

33.3

Coding:

26.0

TAU-Bench:

61.8

Multilingual:

65.4

AIME 2025:

23.0

DeepSeek R1

Open Source

DeepSeek's advanced model with strong coding and reasoning capabilities.

Overall Score:

41.0

MMLU:

-

MMMU:

76.0

GPQA:

71.5

Coding:

25.2

TAU-Bench:

-

Multilingual:

-

AIME 2025:

79.8

Llama 4 405B

Open Source

Meta's next-generation open-source model with state-of-the-art capabilities.

Overall Score:

39.9

MMLU:

85.5

MMMU:

73.4

GPQA:

69.8

Coding:

-

TAU-Bench:

-

Multilingual:

84.6

AIME 2025:

-

LLM Leaderboard

Comparing top-tier general purpose models on key reasoning, language benchmarks, and open source leaderboard.

Top Performers

1st Place

Gemini 3 Pro

2nd Place

Claude 4.5 Opus

3rd Place

ChatGPT 5.1

Model
Gemini 3 Pro Google's flagship model with exceptional multimodal capabilities and massive context window. Top Tier Reasoning	97.6	93.4	87.6	91.9	65.2	80.0	91.8	95.0
Claude 4.5 Opus Anthropic's most powerful model with exceptional reasoning and creative capabilities. Top Tier Reasoning	96.8	90.8	80.7	87.0	70.1	82.4	90.0	87.0
ChatGPT 5.1 OpenAI's most advanced reasoning model with breakthrough performance in complex problem-solving. Great for Creative Tasks	93.1	84.6	80.4	85.7	62.0	81.1	91.0	94.0
Claude 4.5 Sonnet Anthropic's most powerful model with exceptional reasoning and creative capabilities. Top Tier Reasoning	92.8	89.1	77.8	83.4	63.6	81.4	90.0	87.0
ChatGPT 5 OpenAI's most advanced reasoning model with breakthrough performance in complex problem-solving. Top Tier Reasoning	90.8	84.6	84.2	85.7	56.5	81.1	88.8	92.6
Gemini 2.5 Pro Google's flagship model with exceptional multimodal capabilities and massive context window. Top Tier Reasoning	89.5	89.8	84.0	88.4	52.3	80.0	89.0	89.0
Claude 4.1 Opus Anthropic's most powerful model with exceptional reasoning and creative capabilities. Best for Coding	88.9	88.8	77.1	79.6	58.9	82.4	89.5	78.0
Claude 4 Opus Anthropic's most powerful model with exceptional reasoning and creative capabilities. Best for Coding	87.0	88.8	76.5	79.6	55.9	81.4	88.8	75.5
ChatGPT o3 OpenAI's most advanced reasoning model with breakthrough performance in complex problem-solving.	85.7	85.6	82.9	83.3	49.6	70.4	88.8	88.9
ChatGPT o1 OpenAI's reasoning model optimized for complex problem-solving, mathematics, and coding tasks.	83.6	89.3	78.2	78.0	50.5	73.5	82.1	79.3
Claude 4 Sonnet Anthropic's balanced model offering excellent performance across all domains. Best for Coding	82.3	74.4	74.4	75.4	54.1	80.5	86.5	70.5
Qwen3 480B Open Source Alibaba's most powerful Qwen3 model with state-of-the-art performance across all benchmarks.	81.8	82.3	82.4	78.3	47.1	70.9	80.8	83.6
Claude 3.7 Sonnet Anthropic's most powerful model with exceptional reasoning and creative capabilities. Great for Creative Tasks	81.2	88.8	75.0	68.0	52.8	81.2	83.2	61.3
Gemini 2.5 Flash Google's optimized model balancing speed and performance for efficient deployment. Top Tier Reasoning	80.3	88.4	79.7	82.8	42.6	72.3	87.2	72.0
Mistral Large 3 Mistral AI's most advanced model with superior multilingual and coding performance. Great for Creative Tasks	78.3	81.3	73.8	80.2	40.9	78.9	86.4	72.6
ChatGPT 4o OpenAI's omni-modal model with native audio, vision, and text capabilities. Great for Creative Tasks	78.3	88.7	69.1	53.6	46.7	78.0	90.1	76.6
ChatGPT 4.1 OpenAI's enhanced multimodal model with improved reasoning and efficiency.	75.2	74.8	71.8	66.3	42.5	68.0	83.7	79.5
Claude 3.5 Sonnet Anthropic's most capable model, excelling at coding, writing, and complex reasoning tasks.	72.0	88.7	68.3	59.4	54.6	71.5	79.2	16.0
DeepSeek-V3 Open Source DeepSeek's advanced model with strong coding and reasoning capabilities.	71.4	84.1	65.2	70.9	48.6	70.0	71.8	32.0
Gemini 1.5 Pro Google's advanced model with 2M token context window and strong multimodal capabilities. Great for Creative Tasks	68.6	85.9	62.2	63.9	36.5	75.7	88.0	40.0
Qwen2.5 72B Open Source Alibaba's flagship open-source model with exceptional multilingual and coding capabilities.	66.8	72.3	75.2	49.8	40.8	72.4	85.0	37.0
Llama 3.1 70B Open Source Meta's efficient large model offering strong performance with lower computational requirements.	66.2	79.6	68.9	46.7	34.5	73.8	60.3	70.2
DeepSeek-V3 Open Source DeepSeek's advanced model with strong coding and reasoning capabilities.	58.8	81.7	65.2	43.9	29.4	70.0	71.8	32.0
Grok 4 xAI's most advanced model with real-time information access and enhanced reasoning. Top Tier Reasoning	53.0	87.6	72.1	87.5	-	-	83.2	74.5
Grok 3 xAI's improved model with enhanced conversational abilities and real-time data access.	52.9	85.7	76.0	80.2	-	-	80.1	83.0
Claude 3 Haiku Anthropic's fastest model, optimized for speed while maintaining strong capabilities.	49.9	75.2	46.4	33.3	26.0	61.8	65.4	23.0
DeepSeek R1 Open Source DeepSeek's advanced model with strong coding and reasoning capabilities.	41.0	-	76.0	71.5	25.2	-	-	79.8
Llama 4 405B Open Source Meta's next-generation open-source model with state-of-the-art capabilities.	39.9	85.5	73.4	69.8	-	-	84.6	-