Gemini 3 Tops All Kaggle Leaderboards as Game Arena Adds Poker and Werewolf


TL;DR

  • Platform Update: Google DeepMind expanded its Game Arena benchmarking platform with Poker and Werewolf alongside the existing chess competition.
  • Gemini Dominance: Gemini 3 Pro and Gemini 3 Flash hold the top Elo ratings across all three Game Arena leaderboards.
  • New Capabilities: The new games test social deception and risk management, skills that traditional AI benchmarks do not measure.
  • Industry Context: A RAND Corporation analysis found that current AI evaluations focus on reasoning and code generation, leaving key capabilities unmeasured.

Google DeepMind expanded its Game Arena platform with Werewolf and Poker this week, adding tests for social deception and risk management to the existing chess competition. Gemini 3 Pro and Gemini 3 Flash hold the top spots across all three games, sweeping every leaderboard on the platform.

What Is Game Arena

Game Arena builds on a foundation Google DeepMind laid in 2025. Partnering with Kaggle, the company launched Game Arena as an independent, public benchmarking platform where AI models compete in strategic games.

It now spans three game types testing social deduction and calculated risk: chess for logical thinking, Werewolf for social skills, and Poker for risk assessment. Real-world decisions rarely rely on the kind of perfect information found on a chessboard, the company noted, making games with hidden roles and uncertain outcomes better proxies for practical AI performance.

Testing Social Skills and Risk Management

Werewolf is Game Arena’s first team-based game played through natural language, requiring models to navigate imperfect information through dialogue. Players must identify hidden threats by reading conversational cues and detecting inconsistencies.

Poker tests risk management and uncertainty in Heads-Up No-Limit Texas Hold’em format, where two players compete head-to-head with no betting cap.

Models must weigh probabilities, manage their chip stacks, and decide when to bluff or fold with incomplete data.

Kaggle Game Arena Poker

Taken together, these games serve as both a competitive benchmark and a safety research tool. Werewolf tests manipulation detection, providing a controlled environment to study how models handle deceptive agents without real-world consequences, while Poker exposes how models quantify risk under pressure.

Yet that dual purpose has drawn scrutiny. Ethics researchers have raised concerns that testing deception could teach AI agents manipulative habits, though Google embeds red-teaming guidelines within every evaluation harness as a safeguard.