MARL-MAPS: Dynamic Multi-Agent Reinforcement Learning for Optimized RAG
Abstract
Traditional Retrieval-Augmented Generation (RAG) pipelines suffer from a "Context Tax" — bloating LLM inputs with noisy, irrelevant documents that spike latency and increase hallucination risks. Sequential one-way pipelines prevent adaptive backtracking when early retrieval steps fail. This work formalizes the RAG process as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) driven by a learnable Orchestrator policy. We implement a Shared Global Working Memory (SGWM) to prevent context drift and establish a "Confidence as Currency" bi-directional negotiation protocol. Our approach slashes the over-search rate by 91%, eliminates unnecessary retrieval rounds, and achieves 42% faster average inference time while significantly improving exact match and F1 scores.
MAPS: Multi-Agent Reinforcement Learning-based Portfolio Management System
Jinho Lee, Raehyun Kim, Seok-Won Yi, Jaewoo Kang
IJCAI 2020 • 2020
View on arXivMethodology
Dec-POMDP Formulation
Formalized the entire RAG process as a Decentralized Partially Observable Markov Decision Process where each RAG module (query rewriter, retriever, selector, generator) operates as an autonomous agent under partial observability, enabling decentralized decision-making with local observations.
Learnable Orchestrator Policy
Implemented a trained RL policy that makes high-level coordination decisions — when to spawn sub-agents, which tools to delegate to, and how to aggregate outputs from multiple retrieval agents for optimal context assembly.
Shared Global Working Memory (SGWM)
Centralized state representation allowing all agents to access a common pool of information, preventing redundant information gathering, eliminating context drift, and facilitating real-time coordination across the pipeline.
"Confidence as Currency" Protocol
Bi-directional negotiation mechanism where agents trade and spend confidence scores — weighting contributions based on certainty about retrieved information. This enables adaptive backtracking when confidence drops below threshold, preventing hallucination propagation.
RAG-DDR Integration
Differentiable Data Rewards optimize the RAG pipeline end-to-end using rollout-based reward collection and Direct Preference Optimization (DPO), allowing the system to learn from outcome quality rather than intermediate metrics.
Key Results
Over-search rate dropped from 27% to 2.3%, eliminating unnecessary retrieval rounds
Faster average inference time through intelligent context pruning and confidence-based early stopping
Marked improvement in exact match scores on benchmark datasets
Substantial improvement in F1 scores demonstrating better precision-recall balance