LLM Leaderboard

    Comparing top-tier general purpose models on key reasoning and language benchmarks.

    Top Performers

    1st Place

    ChatGPT 5 Logo

    ChatGPT 5

    OpenAI's most advanced reasoning model with breakthrough performance in complex problem-solving.

    Overall Score

    97.3

    2nd Place

    Gemini 2.5 Pro Logo

    Gemini 2.5 Pro

    Google's flagship model with exceptional multimodal capabilities and massive context window.

    Overall Score

    95.7

    3rd Place

    Claude 4.1 Opus Logo

    Claude 4.1 Opus

    Anthropic's most powerful model with exceptional reasoning and creative capabilities.

    Overall Score

    95.5

    OpenAI's most advanced reasoning model with breakthrough performance in complex problem-solving.

    Top Tier Reasoning
    Overall Score:
    97.3
    MMLU:
    84.6
    MMMU:
    84.2
    GPQA:
    85.7
    Coding:
    56.5
    TAU-Bench:
    81.1
    Multilingual:
    88.8
    AIME 2025:
    92.6

    Google's flagship model with exceptional multimodal capabilities and massive context window.

    Top Tier Reasoning
    Overall Score:
    95.7
    MMLU:
    89.8
    MMMU:
    84.0
    GPQA:
    88.4
    Coding:
    52.3
    TAU-Bench:
    80.0
    Multilingual:
    89.0
    AIME 2025:
    89.0

    Anthropic's most powerful model with exceptional reasoning and creative capabilities.

    Best for Coding
    Overall Score:
    95.5
    MMLU:
    88.8
    MMMU:
    77.1
    GPQA:
    79.6
    Coding:
    58.9
    TAU-Bench:
    82.4
    Multilingual:
    89.5
    AIME 2025:
    78.0

    Anthropic's most powerful model with exceptional reasoning and creative capabilities.

    Best for Coding
    Overall Score:
    93.3
    MMLU:
    88.8
    MMMU:
    76.5
    GPQA:
    79.6
    Coding:
    55.9
    TAU-Bench:
    81.4
    Multilingual:
    88.8
    AIME 2025:
    75.5

    OpenAI's most advanced reasoning model with breakthrough performance in complex problem-solving.

    Overall Score:
    91.6
    MMLU:
    85.6
    MMMU:
    82.9
    GPQA:
    83.3
    Coding:
    49.6
    TAU-Bench:
    70.4
    Multilingual:
    88.8
    AIME 2025:
    88.9

    OpenAI's reasoning model optimized for complex problem-solving, mathematics, and coding tasks.

    Overall Score:
    89.5
    MMLU:
    89.3
    MMMU:
    78.2
    GPQA:
    78.0
    Coding:
    50.5
    TAU-Bench:
    73.5
    Multilingual:
    82.1
    AIME 2025:
    79.3

    Anthropic's balanced model offering excellent performance across all domains.

    Best for Coding
    Overall Score:
    88.3
    MMLU:
    74.4
    MMMU:
    74.4
    GPQA:
    75.4
    Coding:
    54.1
    TAU-Bench:
    80.5
    Multilingual:
    86.5
    AIME 2025:
    70.5
    Qwen3 480B Logo
    Qwen3 480B
    Open Source

    Alibaba's most powerful Qwen3 model with state-of-the-art performance across all benchmarks.

    Overall Score:
    87.4
    MMLU:
    82.3
    MMMU:
    82.4
    GPQA:
    78.3
    Coding:
    47.1
    TAU-Bench:
    70.9
    Multilingual:
    80.8
    AIME 2025:
    83.6

    Anthropic's most powerful model with exceptional reasoning and creative capabilities.

    Great for Creative Tasks
    Overall Score:
    87.2
    MMLU:
    88.8
    MMMU:
    75.0
    GPQA:
    68.0
    Coding:
    52.8
    TAU-Bench:
    81.2
    Multilingual:
    83.2
    AIME 2025:
    61.3

    Google's optimized model balancing speed and performance for efficient deployment.

    Top Tier Reasoning
    Overall Score:
    85.6
    MMLU:
    88.4
    MMMU:
    79.7
    GPQA:
    82.8
    Coding:
    42.6
    TAU-Bench:
    72.3
    Multilingual:
    87.2
    AIME 2025:
    72.0

    OpenAI's omni-modal model with native audio, vision, and text capabilities.

    Great for Creative Tasks
    Overall Score:
    83.7
    MMLU:
    88.7
    MMMU:
    69.1
    GPQA:
    53.6
    Coding:
    46.7
    TAU-Bench:
    78.0
    Multilingual:
    90.1
    AIME 2025:
    76.6

    Mistral AI's most advanced model with superior multilingual and coding performance.

    Great for Creative Tasks
    Overall Score:
    83.4
    MMLU:
    81.3
    MMMU:
    73.8
    GPQA:
    80.2
    Coding:
    40.9
    TAU-Bench:
    78.9
    Multilingual:
    86.4
    AIME 2025:
    72.6

    OpenAI's enhanced multimodal model with improved reasoning and efficiency.

    Overall Score:
    80.2
    MMLU:
    74.8
    MMMU:
    71.8
    GPQA:
    66.3
    Coding:
    42.5
    TAU-Bench:
    68.0
    Multilingual:
    83.7
    AIME 2025:
    79.5

    Anthropic's most capable model, excelling at coding, writing, and complex reasoning tasks.

    Overall Score:
    77.8
    MMLU:
    88.7
    MMMU:
    68.3
    GPQA:
    59.4
    Coding:
    54.6
    TAU-Bench:
    71.5
    Multilingual:
    79.2
    AIME 2025:
    16.0
    DeepSeek-V3 Logo
    DeepSeek-V3
    Open Source

    DeepSeek's advanced model with strong coding and reasoning capabilities.

    Overall Score:
    76.8
    MMLU:
    84.1
    MMMU:
    65.2
    GPQA:
    70.9
    Coding:
    48.6
    TAU-Bench:
    70.0
    Multilingual:
    71.8
    AIME 2025:
    32.0

    Google's advanced model with 2M token context window and strong multimodal capabilities.

    Great for Creative Tasks
    Overall Score:
    73.1
    MMLU:
    85.9
    MMMU:
    62.2
    GPQA:
    63.9
    Coding:
    36.5
    TAU-Bench:
    75.7
    Multilingual:
    88.0
    AIME 2025:
    40.0
    Qwen2.5 72B Logo
    Qwen2.5 72B
    Open Source

    Alibaba's flagship open-source model with exceptional multilingual and coding capabilities.

    Overall Score:
    71.5
    MMLU:
    72.3
    MMMU:
    75.2
    GPQA:
    49.8
    Coding:
    40.8
    TAU-Bench:
    72.4
    Multilingual:
    85.0
    AIME 2025:
    37.0
    Llama 3.1 70B Logo
    Llama 3.1 70B
    Open Source

    Meta's efficient large model offering strong performance with lower computational requirements.

    Overall Score:
    70.4
    MMLU:
    79.6
    MMMU:
    68.9
    GPQA:
    46.7
    Coding:
    34.5
    TAU-Bench:
    73.8
    Multilingual:
    60.3
    AIME 2025:
    70.2
    DeepSeek-V3 Logo
    DeepSeek-V3
    Open Source

    DeepSeek's advanced model with strong coding and reasoning capabilities.

    Overall Score:
    62.5
    MMLU:
    81.7
    MMMU:
    65.2
    GPQA:
    43.9
    Coding:
    29.4
    TAU-Bench:
    70.0
    Multilingual:
    71.8
    AIME 2025:
    32.0

    xAI's most advanced model with real-time information access and enhanced reasoning.

    Top Tier Reasoning
    Overall Score:
    54.7
    MMLU:
    87.6
    MMMU:
    72.1
    GPQA:
    87.5
    Coding:
    -
    TAU-Bench:
    -
    Multilingual:
    83.2
    AIME 2025:
    74.5

    xAI's improved model with enhanced conversational abilities and real-time data access.

    Overall Score:
    54.7
    MMLU:
    85.7
    MMMU:
    76.0
    GPQA:
    80.2
    Coding:
    -
    TAU-Bench:
    -
    Multilingual:
    80.1
    AIME 2025:
    83.0

    Anthropic's fastest model, optimized for speed while maintaining strong capabilities.

    Overall Score:
    53.0
    MMLU:
    75.2
    MMMU:
    46.4
    GPQA:
    33.3
    Coding:
    26.0
    TAU-Bench:
    61.8
    Multilingual:
    65.4
    AIME 2025:
    23.0
    DeepSeek R1 Logo
    DeepSeek R1
    Open Source

    DeepSeek's advanced model with strong coding and reasoning capabilities.

    Overall Score:
    44.1
    MMLU:
    -
    MMMU:
    76.0
    GPQA:
    71.5
    Coding:
    25.2
    TAU-Bench:
    -
    Multilingual:
    -
    AIME 2025:
    79.8
    Llama 4 405B Logo
    Llama 4 405B
    Open Source

    Meta's next-generation open-source model with state-of-the-art capabilities.

    Overall Score:
    41.3
    MMLU:
    85.5
    MMMU:
    73.4
    GPQA:
    69.8
    Coding:
    -
    TAU-Bench:
    -
    Multilingual:
    84.6
    AIME 2025:
    -