Meta’s vanilla Maverick AI model ranks below rivals on a popular chat benchmark

Earlier this week, Meta faced backlash for using an experimental version of its Llama 4 Maverick model to secure a high score on the LM Arena benchmark. The controversy led LM Arena maintainers to apologize and revise their policies, eventually scoring the unmodified Maverick version. It turns out, the standard version, 'Llama-4-Maverick-17B-128E-Instruct,' ranked lower than older models like OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 Pro. The release version of Llama 4 now appears significantly lower on the LM Arena rankings, at 32nd place.
Meta explained that the experimental Maverick model was 'optimized for conversationality,' which suited the LM Arena's evaluation system where human raters choose preferred outputs. While this optimization strategy worked for LM Arena, it highlights the pitfalls of tailoring AI models to specific benchmarks. This approach can mislead developers and users about the model's performance in varied real-world applications. Despite the controversy, Meta remains optimistic, stating that the release of the open-source version of Llama 4 will allow developers to customize it for diverse use cases, anticipating valuable feedback and innovations from the community.
RATING
The article provides a well-rounded and largely accurate account of the incident involving Meta's use of an experimental AI model on the LM Arena benchmark. It effectively conveys the main events and their implications, supported by credible sources and statements. The article's clarity and structure make it accessible, though it could benefit from additional perspectives and greater transparency regarding certain details, such as the specific ranking of the Maverick model. While the topic is timely and of interest to those in the tech industry, its focus on technical aspects may limit its broader appeal. Overall, the article successfully informs readers about a relevant issue in AI development, highlighting the need for transparency and fairness in performance evaluations.
RATING DETAILS
The story is largely accurate, with its main claims supported by available evidence and sources. The claim that Meta used an experimental version of its Llama 4 Maverick model to achieve a high score on LM Arena is substantiated by Meta's own acknowledgment and the subsequent response from LM Arena maintainers. The article accurately reports the apology and policy changes from LM Arena, and it correctly identifies the poor performance of the unmodified Maverick model compared to others like GPT-4o and Gemini 1.5 Pro. However, the specific ranking of the vanilla Maverick model at 32nd place requires verification, as it is mentioned in a tweet rather than corroborated by a direct source. The explanation of the experimental Maverick's optimization for conversationality is consistent with Meta's statement, adding credibility to the story.
The article presents a balanced view of the situation, incorporating perspectives from both Meta and LM Arena. It includes Meta's explanation for using an experimental version of the model and the company's plans for its open-source release, which helps provide a fuller picture of the company's intentions and actions. However, the article could have benefited from additional perspectives, such as insights from AI experts or developers who use these models, to provide a more comprehensive understanding of the implications of Meta's actions. The focus on Meta and LM Arena's responses might overshadow the broader context of AI benchmarking practices, which could have been explored further.
The article is generally clear and well-structured, with a logical flow that guides the reader through the events and their implications. The language is straightforward and accessible, making it easy for readers to grasp the main points. The use of specific examples, such as the names of competing models and the experimental Maverick's optimization, helps to clarify the narrative. However, some technical terms, like 'optimized for conversationality,' could be better explained for a general audience unfamiliar with AI terminology.
The article relies on statements from Meta and LM Arena, which are primary sources for the events described. These sources are credible, given their direct involvement in the incident. However, the article could have improved its source quality by including independent verification of the claims, such as expert opinions or third-party analyses of the benchmarking results. The reliance on a tweet for specific ranking details highlights a potential weakness in source quality, as social media posts may lack the rigor and reliability of more formal sources.
The article provides a reasonable level of transparency regarding the basis of its claims, citing Meta's statements and the actions of LM Arena. However, it falls short in explaining the methodology behind LM Arena's benchmarking process and the specific changes made to their policies. Greater transparency in these areas would enhance the reader's understanding of the situation and the implications of the changes. The article could also clarify the basis for the specific ranking of the Maverick model, as this detail is crucial to the story's narrative.
Sources
- https://bestofai.com/article/meta-llama-4-benchmarking-confusion-how-good-are-the-new-ai-models
- https://techcrunch.com/2025/04/11/chatgpt-everything-to-know-about-the-ai-chatbot/
- https://www.techtimes.com/articles/309909/20250407/meta-faces-backlash-over-experimental-maverick-ai-version-used-benchmark-rankings-why.htm
- https://beamstart.com/news/law-professors-side-with-authors-17444057545570
- https://techcrunch.com/2025/04/11/metas-vanilla-maverick-ai-model-ranks-below-rivals-on-a-popular-chat-benchmark/
YOU MAY BE INTERESTED IN

Beyond The Llama Drama: 4 New Benchmarks For Large Language Models
Score 6.8
Meta got caught gaming AI benchmarks
Score 6.8
Meta exec denies the company artificially boosted Llama 4’s benchmark scores
Score 6.4
How to watch LlamaCon 2025, Meta's first generative AI developer conference
Score 7.8