AI Research

MARL-MAPS: Dynamic Multi-Agent RL for Optimized RAG

Decentralized Reinforcement Learning policy architecture eliminating RAG Context Tax.

PythonMulti-Agent RLRAG-DDRDec-POMDPLangChain
// PROBLEM

Traditional Retrieval-Augmented Generation (RAG) pipelines suffer from a "Context Tax"—bloating LLM input with noisy documents which spikes latency and increases hallucination risks. Furthermore, strict, sequential one-way pipelines prevent adaptive backtracking when early retrieval steps fail.

// APPROACH

Formalized the RAG process as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) driven by a learnable Orchestrator policy. Implemented a Shared Global Working Memory (SGWM) to prevent context drift and established a "Confidence as Currency" bi-directional negotiation protocol.

// OUTCOME

Slashed the over-search rate by 91% (dropping from 27% to 2.3%), eliminating unnecessary retrieval rounds. Improved exact match and F1 scores significantly while achieving a 42% faster average inference time.

Key Technical Highlights

Formalized RAG as Dec-POMDP — each module (query rewriter, retriever, selector, generator) acts as an autonomous agent under partial observability

Shared Global Working Memory (SGWM) provides centralized state representation preventing redundant information gathering and context drift

"Confidence as Currency" bi-directional negotiation protocol enables adaptive backtracking when confidence is low

RAG-DDR integration with Differentiable Data Rewards and DPO (Direct Preference Optimization) for end-to-end pipeline optimization

Over-search rate reduced by 91% (27% → 2.3%)

42% faster average inference time

Significant improvement in exact match and F1 scores

Kumar Priyam | Data Engineering & Full-Stack Developer