AWS Unveils 3nm Trainium3 AI Chips and Pivots to Support Nvidia NVLink


Conceding that Nvidia’s ecosystem dominance is currently insurmountable, Amazon Web Services (AWS) executed a major strategic pivot Tuesday. While unveiling its powerful 3-nanometer Trainium3 chips, the cloud giant revealed that its future hardware will support Nvidia’s proprietary NVLink Fusion interconnect.

Targeting the compute-heavy “Age of Inference,” the new Trainium3 UltraServers promise four times the performance of their predecessors. By moving to a 3nm process, AWS aims to offer a cost-effective alternative to Nvidia’s expensive Blackwell GPUs for large-scale reasoning models.

This dual strategy allows Amazon to service extensive clusters for partners like Anthropic while retaining data center rack share from customers committed to Nvidia’s hardware standard.

Trainium3: The 3nm Leap for the ‘Age of Inference’

Marking a significant manufacturing milestone, AWS has moved its silicon to a 3-nanometer process node for the first time. Engineers prioritized density and power efficiency rather than just raw clock speed to achieve these gains.

Metrics indicate a generational leap, with the new chip delivering 4x the performance of the previous Trainium2 launch.

Driving this expansion is not merely a desire for speed, but a pressing need to cap the spiraling energy costs of the AI boom. Energy efficiency has become the primary constraint for data centers, and Trainium3 claims a 4x improvement in performance-per-watt according to official specs.

Such optimization targets the “Age of Inference,” as Ilya Sutskever calls it,  a shift away from the Age of Scaling where running models at scale costs significantly more than training them.

AWS is positioning its new silicon as the premier choice for these workloads. Official documentation states that “Trainium3 is the fastest accelerator, delivering up to 3× faster performance than Trainium2 and 3× better power efficiency than any other accelerator on the service.”

Greater throughput allows data to flow more freely to the compute cores, reducing the idle time that often plagues high-performance clusters. Memory bandwidth, a critical bottleneck for Large Language Models (LLMs), sees a 1.7x increase over the previous generation.

The UltraServer Architecture: Scaling to 144 Chips

Moving beyond the individual chip, AWS also introduced the Trn3 UltraServer as its new unit of compute. Integrating 64 to 144 Trainium3 chips into a single cohesive system creates a density previously unavailable in standard racks.

Connectivity is handled by the proprietary NeuronLink interconnect, designed to minimize latency between chips. According to the official AWS technical specifications:

“Each AWS Trainium3 chip provides 2.52 petaflops (PFLOPs) of FP8 compute, increases the memory capacity by 1.5x and bandwidth by 1.7x over Trainium2 to 144 GB of HBM3e memory, and 4.9 TB/s of memory bandwidth.”

Clusters can scale up to 1 million chips using the new architecture, a magnitude previously theoretical for most enterprises. Such scale is required for “frontier-scale” models that exceed trillion-parameter counts.

By densifying the rack, AWS reduces the physical footprint required for the same amount of compute. Native support for advanced data types like MXFP8 optimizes for the lower precision requirements of AI inference.

 

Strategic Pivot: Why AWS Is Embracing Nvidia’s NVLink

In a surprise roadmap reveal, AWS announced that the future Trainium4 will support Nvidia’s NVLink Fusion according to the roadmap. Such interoperability departs from the “walled garden” approach typical of hyperscaler custom silicon.

Far from a simple hardware upgrade, the introduction of NVLink support signals a strategic shift in Amazon’s competitive philosophy. Observers interpret the move as a tacit admission that Nvidia’s ecosystem dominance is too strong to displace entirely.

By supporting NVLink, AWS allows customers to mix and match Nvidia GPUs with AWS infrastructure in the same cluster. This “hybrid” approach contrasts sharply with Google’s Ironwood TPU strategy, which relies entirely on its own vertical stack and optical switching.

Amazon aims to capture the “rack share” of the data center even if the “chip share” remains with Nvidia. Reporting from Data Center Dynamics outlines the forward-looking capabilities:

“Trainium4, AWS says, will have at least 3x the FP8 processing power and 4x more memory bandwidth than Trainium3. In addition, Trainium4 will be designed to support Nvidia NVLink Fusion chip interconnect technology.”

Independent analysts have not yet publicly verified these performance claims against Nvidia’s Blackwell architecture. However, market watchers view this as a “co-existence” strategy, acknowledging that Nvidia’s CUDA moat remains formidable.

Project Rainier: The Anthropic Factor

Anthropic stands to benefit most from this scale-out architecture as AWS’s key AI partner. “Project Rainier” provides a large-scale compute cluster powered by hundreds of thousands of Trainium chips.

Engineers describe the cluster as five times more powerful than Anthropic’s current systems. Such power is critical for training Claude 4 and future iterations to compete with GPT-5.

AWS’s $8 billion investment in Anthropic is effectively being “round-tripped” back into AWS compute revenue. Validating the hardware proves it can handle the most demanding workloads in the industry.

Unlike Microsoft’s reliance on OpenAI (and thus Nvidia), AWS is building a sovereign stack for its model partner.

The Economics of Reasoning: Efficiency vs. Brute Force

Broadly, the sector is shifting from simple token prediction to complex “System 2” reasoning tasks. Reasoning models, like OpenAI’s GPT-5.1 or Google’s Gemini 3 Pro, require exponentially more compute at runtime.

Jevons Paradox suggests that as inference becomes cheaper, demand for “thinking” time will explode, keeping total costs high.

AWS is positioning Trainium3 not as the “fastest” absolute chip, but as the most “economic” for these long-running queries. The company claims that “in large-scale serving tests (e.g., GPT-OSS), Trn3 delivers over 5× higher output tokens per megawatt than Trn2 at similar latency per user.”

Such flexibility contrasts with Google’s 1,000x capacity mandate, which relies on brute-force vertical integration. For enterprise CIOs, the choice between Nvidia (flexibility) and Trainium (cost) is becoming the defining architectural decision of 2026.



Source link

Recent Articles

spot_img

Related Stories