New Mamba-3 AI Model Beats Transformers by 4%, Runs 7x Faster


TL;DR

  • New Release: Together.ai has released Mamba-3, an open-source state space model that outperforms the Transformer baseline by nearly 4% on language modeling benchmarks.
  • Speed Gains: Mamba-3 completes long-sequence tasks up to 7x faster than Transformer models on identical H100 GPU hardware.
  • Industry Adoption: NVIDIA and IBM have already shipped hybrid Mamba-Transformer models for enterprise deployments, validating the architecture beyond research.
  • Open Access: The model is released under Apache 2.0 with code on GitHub and a paper accepted at ICLR 2026.

AI research lab Together.ai has released Mamba-3, an open-source state space model that outperforms the Transformer baseline by nearly 4% on language modeling benchmarks while running up to 7x faster at long sequences.

Released under Apache 2.0 on March 17, Mamba-3 arrives at a moment when agentic workflows have made inference efficiency the primary bottleneck in AI deployment. Tools like Codex and Claude Code generate sustained token streams that strain Transformer deployments, where attention scales quadratically with sequence length. By delivering a 2.2-percentage-point leap over the Transformer baseline while completing long-sequence tasks nearly 7x faster, Mamba-3 offers a concrete alternative to those scaling costs.

Inference Over Training

Previous generations of Mamba prioritized training speed. Mamba-2, released in mid-2024, delivered 2-8x faster training than Mamba-1 but did not address the decode bottleneck that dominates production costs. With Mamba-3, researchers reversed that focus entirely, targeting an inference-first design that addresses what they call the “cold GPU” problem: hardware sitting idle during decode while waiting for memory movement rather than performing computation.

Speed gains are substantial. At 16,384-token sequence length on an H100 GPU, Mamba-3 completes prefill and decode in 140.61 seconds compared to 976.50 seconds for Meta’s Llama-3.2-1B running on vLLM. Measured differently, Mamba-3 doubles inference throughput for identical hardware by halving state size compared to Mamba-2 while matching its perplexity. For organizations running AI coding assistants, customer service agents, or multi-step research workflows, that efficiency gain translates directly into lower compute bills.

Unlike Transformers, which must recompute attention across all prior tokens at each step, state space models compress context into a fixed-size state that grows linearly with sequence length, making them inherently better suited to sustained generation.