RESEARCH PAPER

MARL-MAPS: Dynamic Multi-Agent Reinforcement Learning for Optimized RAG

Kumar Priyam|National Institute of Technology Delhi|2025

Multi-Agent RLRAG OptimizationDec-POMDPLLM SystemsInformation Retrieval

Abstract

Traditional Retrieval-Augmented Generation (RAG) pipelines suffer from a "Context Tax" — bloating LLM inputs with noisy, irrelevant documents that spike latency and increase hallucination risks. Sequential one-way pipelines prevent adaptive backtracking when early retrieval steps fail. This work formalizes the RAG process as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) driven by a learnable Orchestrator policy. We implement a Shared Global Working Memory (SGWM) to prevent context drift and establish a "Confidence as Currency" bi-directional negotiation protocol. Our approach slashes the over-search rate by 91%, eliminates unnecessary retrieval rounds, and achieves 42% faster average inference time while significantly improving exact match and F1 scores.

Built Upon

MAPS: Multi-Agent Reinforcement Learning-based Portfolio Management System

Jinho Lee, Raehyun Kim, Seok-Won Yi, Jaewoo Kang

IJCAI 2020 • 2020

View on arXiv

Methodology

Dec-POMDP Formulation

Formalized the entire RAG process as a Decentralized Partially Observable Markov Decision Process where each RAG module (query rewriter, retriever, selector, generator) operates as an autonomous agent under partial observability, enabling decentralized decision-making with local observations.

Learnable Orchestrator Policy

Implemented a trained RL policy that makes high-level coordination decisions — when to spawn sub-agents, which tools to delegate to, and how to aggregate outputs from multiple retrieval agents for optimal context assembly.

Shared Global Working Memory (SGWM)

Centralized state representation allowing all agents to access a common pool of information, preventing redundant information gathering, eliminating context drift, and facilitating real-time coordination across the pipeline.

"Confidence as Currency" Protocol

Bi-directional negotiation mechanism where agents trade and spend confidence scores — weighting contributions based on certainty about retrieved information. This enables adaptive backtracking when confidence drops below threshold, preventing hallucination propagation.

RAG-DDR Integration

Differentiable Data Rewards optimize the RAG pipeline end-to-end using rollout-based reward collection and Direct Preference Optimization (DPO), allowing the system to learn from outcome quality rather than intermediate metrics.

Key Results

91%

Over-Search Reduction

Over-search rate dropped from 27% to 2.3%, eliminating unnecessary retrieval rounds

42%

Inference Speedup

Faster average inference time through intelligent context pruning and confidence-based early stopping

Significant ↑

Exact Match

Marked improvement in exact match scores on benchmark datasets

Significant ↑

F1 Score

Substantial improvement in F1 scores demonstrating better precision-recall balance

Documents & Resources

Full Research ReportGoogle Docs

Presentation SlidesGoogle Slides

Base Paper (IJCAI 2020)Google Drive