Microsoft Adds Managed Memory to Foundry Agent Service, Ending ‘Stateless’ AI Era


TL;DR

  • The gist: Microsoft in Novemberlaunched a managed memory feature for its Foundry Agent Service, enabling AI agents to retain long-term user context across sessions.
  • Key details: The service is free during preview, with limits of 10,000 memory items per scope and 1,000 requests per minute.
  • Why it matters: This eliminates the need for developers to build custom vector databases, treating memory as a core infrastructure primitive for enterprise AI.
  • Context: This move follows similar releases from AWS and Google, signaling a shift from stateless chatbots to persistent, context-aware agentic systems.

Microsoft is attempting to solve the “goldfish memory” problem plaguing enterprise AI by integrating long-term state management directly into its Foundry Agent Service. A new managed memory feature, now in public preview, allows developers to build agents that retain user context across sessions without maintaining complex custom databases.

By treating memory as a core infrastructure primitive rather than an add-on, Microsoft aims to eliminate the heavy lifting of vector storage and retrieval-augmented generation (RAG) pipelines. The service automatically handles the extraction, consolidation, and retrieval of facts, enabling agents to recall user preferences and conversation history natively.

This move positions Microsoft to compete directly with similar managed offerings from AWS and Google, signaling a broader industry shift away from stateless chatbots toward persistent, context-aware agentic systems.

Promo

A Three-Phase Approach to Context

Microsoft’s Foundry Agent Service now includes a managed memory store, addressing the limitation where agents reset context after every session. Previously, developers had to build custom solutions to maintain continuity, often relying on external vector databases and complex retrieval logic.

Operationally, the service manages the information lifecycle through three distinct phases: Extraction, Consolidation, and Retrieval. During the Extraction phase, the service identifies key facts and user preferences from ongoing conversations, such as dietary restrictions or project details.

Following extraction, the Consolidation phase uses LLMs to merge duplicate facts and resolve conflicts. For example, if a user updates their preference from “coffee” to “tea,” the system detects the conflict and updates the record to ensure accuracy.

Retrieval employs hybrid search to surface relevant memories at the start of a session, ensuring agents have immediate context without requiring the user to repeat information. By automating these steps, the architecture replaces the manual “plumbing” developers previously built using vector databases and custom embedding logic.

The system employs a methodical approach to memory management. It automatically scans through conversation histories to extract and preserve two key types of data: specific details regarding the user’s profile and broader summaries of previous chat interactions.

Operational Limits and Security

To ensure production viability, the preview service imposes strict operational limits. Each memory scope, which can be defined per user or session, is capped at 10,000 discrete memory items. Throughput is throttled to 1,000 requests per minute to maintain system stability during the preview phase.

Security and isolation are handled via scoping mechanisms, allowing developers to partition memory by Entra ID or custom UUIDs. Such partitioning ensures that user data remains isolated and accessible only within the appropriate context.

According to the technical documentation, there are specific operational limits and financial terms in place for the current phase. In terms of capacity, the system enforces a cap of 10,000 distinct memory items per scope, alongside a throughput restriction that limits processing to 1,000 requests every minute.

Regarding pricing, the memory capabilities themselves are currently offered at no additional cost while in public preview; however, users will still incur standard charges for the consumption of the underlying chat and embedding models required to power these features.

During the public preview, the feature is free to use, with costs incurred only for the underlying token usage of the chat and embedding models. This pricing strategy lowers the barrier for experimentation, encouraging developers to integrate stateful capabilities into their applications.

Vivan Amim, Director of AI Research at Microsoft, contextualized the strategic importance of this architectural change:

“Memory is quickly becoming the ‘state layer’ for agentic systems. Foundry is turning that from a demo feature into an enterprise primitive.”

Competing for the State Layer

Microsoft’s move follows similar releases from major cloud competitors, signaling a commoditization of agent memory. AWS Bedrock Agents introduced memory retention in July 2024, offering a default 30-day retention period configurable up to a year.

Google’s Vertex AI Memory Bank also provides managed long-term memory, distinguishing between short-term working memory and long-term storage. Collectively, these services reflect a growing consensus that state management is a platform-level concern rather than an application-level problem.

For developers, the trade-off shifts from “build vs. buy” to selecting a platform ecosystem that handles state management most effectively. While custom solutions offer infinite flexibility, managed services reduce the maintenance burden of vector stores and retrieval logic.

This development builds on the Azure AI Foundry platform, which unified Microsoft’s AI tools in November 2024, and the subsequent launch of the Azure AI Agent Service in February 2025. It also aligns with the broader Agent Framework launch from October 2025, which aims to standardize agent development across the industry.



Source link

Recent Articles

spot_img

Related Stories