LinkedIn’s AI-first strategy, which includes the development of sophisticated agents to enhance professional success, has recently advanced through efforts to enable agentic reinforcement learning (RL) training for the GPT-OSS model. This novel approach optimizes entire decision-making processes by allowing AI models to interact directly with environments, collecting on-policy data over multi-step trajectories. The validation of GPT-OSS for agentic RL training addresses a critical gap, as previous work primarily focused on fine-tuning without tool-calling capabilities. Successfully integrating GPT-OSS into agentic RL frameworks promises to deliver more scalable, reliable, and adaptable AI systems crucial for complex professional tasks like information retrieval, query refinement, and multi-step workflow execution.

KEY DEVELOPMENTS

  • Agentic reinforcement learning (RL) extends traditional LLM training by optimizing multi-step decision processes through direct environmental interaction.
  • GPT-OSS, a model comparable to OpenAI’s o3-mini and o4-mini, has been successfully adapted for agentic RL training using the verl framework.
  • Initial training runs encountered significant issues, including exploding KL divergence, entropy, and non-increasing rewards, particularly with GPT-OSS-20B.
  • Key fixes involved resolving a log-probability mismatch in the Mixture of Experts (MoE) architecture and implementing attention sink support within FlashAttention v3.
  • Memory efficiency during training was improved by patching the Hugging Face Transformers implementation to avoid repeated materialization of MoE experts under FSDP.

WHAT HAPPENED

Researchers embarked on a journey to validate GPT-OSS, particularly the 20B and 120B variants, for agentic reinforcement learning training. This initiative aimed to move beyond single-turn response optimization towards systems capable of multi-step reasoning, tool invocation, and adaptive behavior in dynamic environments. The verl framework, a popular open-source tool for agentic RL, was chosen for these experiments, with tasks like GSM8K, Retool, and verifiable instruction following used for evaluation.

Early training attempts with GPT-OSS-20B revealed substantial instability, characterized by exploding KL divergence, increasing entropy, and a failure to improve rewards. A core problem identified was a log-probability mismatch in Proximal Policy Optimization (PPO) due to the Mixture of Experts (MoE) architecture, where different expert routing during dual forward passes led to deviations from the required on-policy importance ratio of one. This was addressed by programmatically enforcing the ratio to one when on-policy conditions were met.

Further investigation uncovered a training-inference mismatch, where optimizations in inference engines like vLLM and SGLang clashed with the numerical stability priorities of FSDP during training. A significant finding was the lack of attention sink support in the FlashAttention v2 implementation used by verl’s FSDP worker, and incomplete backward pass support in FlashAttention v3. Researchers adapted the forward pass from vLLM’s FlashAttention fork and implemented the necessary backward pass for attention sink parameters, leading to stable and improved reward curves across various tasks.

WHY IT MATTERS

The successful integration of GPT-OSS into agentic RL training signifies a substantial step forward for developing more capable and reliable AI systems. Agentic RL allows models to learn complex, multi-step decision-making processes directly from interaction with environments, moving beyond static dataset limitations. This is particularly relevant for applications requiring nuanced reasoning, tool coordination, and adaptation to evolving user intents, such as those found in professional services, recruitment, and knowledge management.

Overcoming the technical hurdles associated with GPT-OSS’s MoE architecture and attention mechanisms ensures that open-source models can participate in this advanced training paradigm. The stability and improved convergence demonstrated after the fixes indicate that GPT-OSS can serve as a robust backbone for building scalable agentic applications. This validation enhances the potential for broader adoption of agentic AI by making advanced training techniques accessible for powerful open-source models.

INDUSTRY IMPACT

The ability to train models like GPT-OSS with agentic RL has broad implications across various industries. For companies building AI agents for customer support, professional services, or complex workflow automation, this advancement means more sophisticated and adaptable systems. Agents can learn to refine queries, coordinate multiple tools, and execute multi-step tasks with greater efficacy, directly translating to enhanced user experience and operational efficiency.

In sectors like recruitment, education, and research, where AI agents assist with information retrieval and knowledge seeking, the improved stability and performance of agentic GPT-OSS models could lead to more intelligent and reliable assistants. The fixes for memory efficiency also make large-scale agentic training more feasible, potentially lowering the computational barrier for developing advanced AI capabilities.

ANALYSIS

The journey to enable agentic RL training for GPT-OSS highlights the intricate challenges inherent in adapting large, complex models, particularly those with Mixture of Experts (MoE) architectures, to advanced training paradigms. The initial instability observedβ€”exploding gradients and non-improving rewardsβ€”underscored fundamental discrepancies between theoretical on-policy assumptions and practical implementation details in MoE models. The resolution, which involved enforcing the importance ratio to one, was not merely a patch but a critical re-alignment with the mathematical foundations of PPO, ensuring that policy updates were indeed based on the current policy’s data.

Beyond the MoE-specific issues, the deeper problem of training-inference mismatch points to a broader challenge in the AI development lifecycle. The divergence in how models are executed during high-throughput inference versus numerically stable training environments can inadvertently convert on-policy learning into an off-policy problem. The targeted fix for FlashAttention v3’s attention sink mechanism, coupled with the memory efficiency improvements, demonstrates the necessity of deep-level architectural understanding and modification to unlock the full potential of large language models in interactive learning settings. This work sets a precedent for how future open-source models with similar architectures might be adapted for agentic capabilities, emphasizing precision in low-level kernel implementation and distributed training strategies.

FUTURE IMPLICATIONS

Near-term (3-6 months), the refined training methodologies for GPT-OSS could lead to a proliferation of more stable and effective open-source agentic AI solutions. Developers will likely see increased adoption of GPT-OSS in frameworks requiring multi-step reasoning and tool use. Medium-term (1-2 years), these advancements are expected to influence the design of next-generation LLM architectures, pushing for better integration of attention mechanisms and MoE layers with RL training from the outset. Long-term (3-5 years), the lessons learned from debugging and optimizing agentic RL for GPT-OSS could contribute to a future where AI agents can autonomously tackle highly complex, real-world problems with minimal human intervention, fundamentally altering how professionals interact with digital tools and information.

ACTIONABLE INSIGHTS

  • Prioritize validation of core RL assumptions, such as the importance sampling ratio, especially when working with MoE architectures.
  • Investigate potential training-inference mismatches when encountering instability in RL training, as kernel differences can have significant impacts.
  • Ensure full compatibility and correct implementation of specialized architectural features, like attention sinks, across both forward and backward passes in distributed training.
  • Optimize memory usage in FSDP setups for large MoE models by carefully reviewing and patching hidden state materialization during log-probability computations.
  • Actively contribute to or monitor open-source frameworks like verl for updates and community-driven solutions to common RL training challenges.

What is agentic reinforcement learning (RL)?

Agentic RL extends traditional LLM training by optimizing an entire decision-making process through direct interaction with an environment, rather than just a single-turn response. It trains policies by collecting on-policy data as the agent plans, invokes tools, and adapts behavior over multi-step trajectories.

What challenges did GPT-OSS face in agentic RL training?

Initial challenges included exploding KL divergence and entropy, non-increasing rewards, a log-probability mismatch in the MoE architecture causing PPO clip issues, and a training-inference mismatch due to differing attention kernel implementations between inference and training stacks.

How were the log-probability mismatch and attention sink issues resolved?

The log-probability mismatch was resolved by enforcing the importance sampling ratio to one when on-policy. Attention sink support was implemented by adapting the forward pass from vLLM’s FlashAttention fork and developing a custom backward pass for FlashAttention v3.

What was the impact of the memory efficiency fix for GPT-OSS?

The memory efficiency fix addressed excessive memory allocation during FSDP forward passes by patching the Hugging Face Transformers implementation. This avoided repeated materialization of MoE experts, preventing out-of-memory errors and making large-scale training more feasible.

What kind of tasks did GPT-OSS show improved performance on after the fixes?

After applying the fixes, GPT-OSS-20B showed substantially faster convergence and steady reward improvement across single-turn RL on math reasoning (GSM8K), instruction following (VerifyIf), and multi-turn agentic RL with tool use (ReTool).

KEY TAKEAWAYS

  • Agentic RL training for GPT-OSS has been successfully enabled, moving beyond single-turn optimization to multi-step decision processes.
  • Critical debugging efforts addressed issues like log-probability mismatches in MoE architectures and training-inference inconsistencies related to attention mechanisms.
  • Specific fixes included enforcing an importance sampling ratio of one for PPO and implementing full attention sink support in FlashAttention v3.
  • Memory efficiency for GPT-OSS training under FSDP was significantly improved by optimizing MoE expert materialization.
  • The validated GPT-OSS-20B model now exhibits stable training and faster convergence on tasks ranging from math reasoning to multi-turn agentic tool use.