TL;DR
- New Release: Together.ai has released Mamba-3, an open-source state space model that outperforms the Transformer baseline by nearly 4% on language modeling benchmarks.
- Speed Gains: Mamba-3 completes long-sequence tasks up to 7x faster than Transformer models on identical H100 GPU hardware.
- Industry Adoption: NVIDIA and IBM have already shipped hybrid Mamba-Transformer models for enterprise deployments, validating the architecture beyond research.
- Open Access: The model is released under Apache 2.0 with code on GitHub and a paper accepted at ICLR 2026.
AI research lab Together.ai has released Mamba-3, an open-source state space model that outperforms the Transformer baseline by nearly 4% on language modeling benchmarks while running up to 7x faster at long sequences.
Released under Apache 2.0 on March 17, Mamba-3 arrives at a moment when agentic workflows have made inference efficiency the primary bottleneck in AI deployment. Tools like Codex and Claude Code generate sustained token streams that strain Transformer deployments, where attention scales quadratically with sequence length. By delivering a 2.2-percentage-point leap over the Transformer baseline while completing long-sequence tasks nearly 7x faster, Mamba-3 offers a concrete alternative to those scaling costs.
Inference Over Training
Previous generations of Mamba prioritized training speed. Mamba-2, released in mid-2024, delivered 2-8x faster training than Mamba-1 but did not address the decode bottleneck that dominates production costs. With Mamba-3, researchers reversed that focus entirely, targeting an inference-first design that addresses what they call the “cold GPU” problem: hardware sitting idle during decode while waiting for memory movement rather than performing computation.
Speed gains are substantial. At 16,384-token sequence length on an H100 GPU, Mamba-3 completes prefill and decode in 140.61 seconds compared to 976.50 seconds for Meta’s Llama-3.2-1B running on vLLM. Measured differently, Mamba-3 doubles inference throughput for identical hardware by halving state size compared to Mamba-2 while matching its perplexity. For organizations running AI coding assistants, customer service agents, or multi-step research workflows, that efficiency gain translates directly into lower compute bills.
Unlike Transformers, which must recompute attention across all prior tokens at each step, state space models compress context into a fixed-size state that grows linearly with sequence length, making them inherently better suited to sustained generation.
Three Technical Innovations
Three core changes to state space model architecture drive Mamba-3’s improvement. First, exponential-trapezoidal discretization replaces the first-order Exponential-Euler method used in earlier iterations with a second-order accurate approximation. By itself, this change removed the short causal convolution that had been a fixture of recurrent architectures since Mamba-1 and RWKV-4, simplifying the model while improving accuracy. Eliminating that convolution also reduces computational overhead during both training and inference, since one fewer operation must execute at each time step.
Second, complex-valued SSMs via what researchers call the RoPE trick enable Mamba-3 to solve synthetic reasoning tasks that were impossible for Mamba-2. Where earlier SSM variants struggled with tasks requiring precise positional awareness, complex-valued states encode rotational position information directly into the recurrence, closing a key gap with Transformer attention. Positional encoding had long been considered a weakness of recurrent models, and this approach borrows the same rotary embedding technique that improved Transformer performance in models like Llama.
A MIMO (multiple-input multiple-output) formulation rounds out the trio, performing up to 4x more parallel mathematical operations per step by switching from outer-product to matrix-multiplication-based state updates. Because modern GPUs have far more compute capacity than memory bandwidth, MIMO exploits idle arithmetic units during decode without increasing latency. In practice, MIMO matches Mamba-2’s decode speed while delivering substantially stronger accuracy across all tested benchmarks.
Mamba co-creator Albert Gu expressed satisfaction with the design, noting that the three methodological changes were inspired by “elegant math and methods.”
Benchmark Results and Industry Adoption
Architectural changes translate into measurable gains. At 1.5-billion parameter scale, Mamba-3’s MIMO variant achieves 57.6% average accuracy across downstream benchmarks. According to the research team’s arXiv paper, that figure represents a 0.6-percentage-point improvement over Gated DeltaNet, previously the strongest non-Transformer model. Adding MIMO on top of base Mamba-3 contributes another 1.2 points, for a total 1.8-percentage-point gain over Gated DeltaNet. At 1-billion parameters, MIMO boosts accuracy by over 1 percentage point versus standard Mamba-3.
MIMO achieves these gains without sacrificing decode speed. According to Together.ai, training costs increase with additional parallel operations, but decode latency stays flat as idle GPU cores absorb the extra work. For production deployments where inference dominates cost, MIMO offers strictly better quality at equivalent serving expense.
Beyond MIMO, Mamba-3 SISO, the simpler single-input single-output variant, also improves over its predecessor. It matches Mamba-2’s architecture shapes exactly while delivering better accuracy, and achieves comparable perplexity with half the state size. For developers who need a drop-in replacement for Mamba-2 without adopting the full MIMO formulation, SISO provides an immediate upgrade path with no architectural changes required.
Benchmarks were conducted on a single H100-SXM 80GB GPU with batch size 128, a configuration accessible to many research labs and mid-size companies. Reproducing these results does not require multi-node clusters, which lowers the barrier for independent verification and adoption.
Prefill latency
Prefill+decode latency
SSM architectures do carry a known trade-off, however. Pure SSM models compress all prior context into a fixed-size state, which means they underperform Transformers on retrieval-heavy tasks that require precise recall of specific earlier tokens. A question-answering system that must locate a single fact buried in a 100,000-token document, for example, would still benefit from full attention over the input. Researchers acknowledge this limitation in their paper, which is one reason they predict hybrid architectures combining SSM and attention layers will dominate production deployments.
Enterprise Momentum
Hybrid prediction is already playing out in practice. NVIDIA released its Nemotron 3 Super on March 11, just one week before Mamba-3. Built as a 120B parameter hybrid Mamba-Transformer MoE, Nemotron 3 Super delegates the majority of sequence processing to Mamba-2 layers with Transformer attention layers interleaved at regular intervals. According to NVIDIA, this hybrid design achieves a 1-million-token context window through SSM efficiency while maintaining strong retrieval performance via the interleaved attention layers.
Nemotron 3 Super’s architecture illustrates how enterprises are already leveraging Mamba’s strengths. Only 12 billion of its 120 billion total parameters are active at inference time, courtesy of the mixture-of-experts design. Combined with SSM layers that read data linearly rather than recomputing quadratic attention, the model can process extended multi-agent conversations that generate up to 15x the tokens of standard chat sessions without proportional cost increases.
IBM followed a similar path in October 2025 with its Granite 4.0 models, which also adopted a hybrid Mamba-Transformer architecture to reduce serving costs for enterprise customers running large-scale deployments. With two of the largest AI infrastructure companies now shipping Mamba-based production models, the architecture has moved well beyond the early research prototype stage and into active commercial deployment.
Looking Ahead
Given that enterprise momentum, Mamba-3’s research team predicts that linear layers will predominantly be used alongside global self-attention layers for language modeling, noting that hybrid models outperform pure Transformers on retrieval tasks while maintaining efficiency. Pure SSM models excel at sustained generation and long-context processing, while attention layers handle precise information retrieval, making the combination stronger than either approach alone. With the paper accepted at ICLR 2026 and code published on GitHub, practitioners can begin integrating Mamba-3 immediately.
Student researchers Aakash Lahoti and Kevin Y. Li led development of Mamba-3, building on the original architecture created in 2023 by Albert Gu of Carnegie Mellon and Tri Dao of Princeton. Gu, who describes himself on X as “leading the SSM revolution,” has now overseen three major iterations of the architecture in just over two years. Apache 2.0 licensing means commercial deployment requires no special permissions, removing a barrier that has slowed adoption of other open-weight models with more restrictive terms. Whether Mamba-3 displaces Transformers outright or finds its strongest role inside hybrid architectures, its release marks a milestone: for the first time, a pure state space model has beaten the Transformer baseline on standard language benchmarks.

