Helpful context:


Single-agent reinforcement learning rests on a clean assumption: the environment is stationary. Given a fixed MDP, Bellman’s equations have a unique solution, value iteration converges, and policy gradient methods can be shown to reach local optima. The agent learns; the world does not.

In a multi-agent system, this assumption collapses. Other agents are part of the environment - and they are learning too. From the perspective of any single agent, the transition and reward distributions shift as others update their policies. The environment is non-stationary by construction. Standard convergence guarantees that rely on a fixed MDP simply do not apply. This is not a technicality; it is a fundamental change in the nature of the problem.

The right framework is no longer the MDP but the stochastic game (also called a Markov game). The central questions become: what does it mean for agents to play optimally when optimality depends on what others do? How should cooperating agents share credit for a joint outcome? How should agents communicate, and what is worth communicating? These questions sit at the intersection of decision theory, game theory, and learning - and most of them remain partially open.


From Single Agent to Multi-Agent

A stochastic game is a tuple $(S, \{A_i\}{i=1}^n, P, \{R_i\}{i=1}^n, \gamma)$:

  • $S$ - shared state space.
  • $A_i$ - action space of agent $i$. The joint action space is $A = A_1 \times \cdots \times A_n$. An element $\mathbf{a} = (a_1, \ldots, a_n) \in A$ is a joint action.
  • $P: S \times A \times S \to [0,1]$ - transition function, now conditioned on the joint action: $P(s' \mid s, \mathbf{a})$.
  • $R_i: S \times A \to \mathbb{R}$ - reward function of agent $i$. Each agent has its own reward, in general.
  • $\gamma \in [0,1)$ - shared discount factor.

Each agent $i$ has a policy $\pi_i: S \to \Delta(A_i)$ mapping states to distributions over its own actions. The joint policy is $\boldsymbol{\pi} = (\pi_1, \ldots, \pi_n)$. When $n = 1$, the stochastic game recovers the standard MDP exactly.

Non-stationarity. From agent $i$’s perspective, the effective environment is determined by the joint transition

$$P^{\boldsymbol{\pi}}(s' \mid s, a_i) = \sum_{a_{-i}} \left(\prod_{j \neq i} \pi_j(a_j \mid s)\right) P(s' \mid s, a_i, a_{-i}),$$

where $a_{-i} = (a_1, \ldots, a_{i-1}, a_{i+1}, \ldots, a_n)$. As other agents update their policies, $P^{\boldsymbol{\pi}}$ changes - so the MDP that agent $i$ is trying to solve is a moving target. This violates the stationarity requirement of standard Q-learning convergence proofs and most policy gradient analyses.


Cooperative vs Competitive vs Mixed

The structure of the reward functions determines the fundamental character of the interaction.

Fully cooperative (team problems). All agents share a single reward: $R_1 = R_2 = \cdots = R_n = R_{\text{joint}}$. There is no conflict of interest - the group optimizes a common objective. This might seem simple, but the central difficulty is credit assignment: given a team reward, how much did each agent contribute? Real examples include multi-robot warehouse logistics, networked sensor systems, and cooperative navigation.

Zero-sum. Rewards sum to zero at every step: $\sum_{i=1}^n R_i(s, \mathbf{a}) = 0$ for all $(s, \mathbf{a})$. What one agent gains, the others collectively lose. In two-player zero-sum games, minimax optimal policies exist and the minimax theorem guarantees a unique game value. Chess, Go, and heads-up poker are canonical examples. Zero-sum is the best-understood regime theoretically.

General-sum. Rewards are arbitrary - neither shared nor zero-sum. Agents may have partially aligned, partially conflicting interests. This describes most real-world multi-agent settings: drivers in traffic, firms in a market, negotiating parties. The general-sum case is the hardest: neither cooperative nor adversarial reasoning alone suffices, and the solution concept (Nash equilibrium) carries no guarantee of efficiency.


Nash Equilibrium

In single-agent RL, the optimal policy is defined by the Bellman optimality equations - it is the policy that maximizes the agent’s own expected return against a fixed environment. In multi-agent settings, what constitutes “optimal” depends on what others do. The standard solution concept is the Nash equilibrium.

A joint policy $\boldsymbol{\pi}^* = (\pi_1^, \ldots, \pi_n^)$ is a Nash equilibrium if no agent can increase its expected return by unilaterally changing its policy:

$$V_i(\pi_i^, \boldsymbol{\pi}_{-i}^) \geq V_i(\pi_i, \boldsymbol{\pi}_{-i}^*) \quad \forall \pi_i,\ \forall i,$$

where $\boldsymbol{\pi}{-i}^* = (\pi_1^*, \ldots, \pi{i-1}^, \pi_{i+1}^, \ldots, \pi_n^)$ denotes the policies of all agents other than $i$, and $V_i(\pi_i, \boldsymbol{\pi}_{-i}^)$ is agent $i$’s expected discounted return when it plays $\pi_i$ and everyone else plays $\boldsymbol{\pi}_{-i}^*$.

Nash’s theorem (1950): every finite normal-form game has at least one Nash equilibrium, possibly in mixed strategies. The proof uses Kakutani’s fixed-point theorem applied to the best-response correspondence.

Nash equilibria have significant pathologies in practice:

  1. Pareto suboptimality. The prisoner’s dilemma is the canonical example: the unique Nash equilibrium (mutual defection) is strictly worse for both players than mutual cooperation. Rational self-interest produces a collectively bad outcome.
  2. Multiplicity. Games can have many Nash equilibria, with no principled way to select among them. Equilibrium selection is an open problem.
  3. Computational hardness. Finding a Nash equilibrium in a general-sum game is PPAD-complete. There is no known polynomial-time algorithm, and the problem is believed to be computationally intractable in the worst case.

Convergence in MARL. Independent learners - agents each running their own RL algorithm while treating others as part of the environment - do not converge to Nash equilibria in general. Convergence is guaranteed in specific cases: two-player zero-sum games (where minimax-Q converges), and potential games, where the incentives of all agents are aligned with a single global potential function $\Phi: S \times A \to \mathbb{R}$ such that $R_i(s, \mathbf{a}) - R_i(s, \mathbf{a}') = \Phi(s, \mathbf{a}) - \Phi(s, \mathbf{a}')$ for any unilateral deviation by agent $i$.


Credit Assignment

Credit assignment is the central challenge of cooperative multi-agent RL. When agents share a reward, that reward is a function of every agent’s action; attributing it to individuals is non-trivial.

Global reward. Set $R_i = R_{\text{joint}}$ for all $i$. Simple to implement, but provides no signal differentiating useful actions from useless ones. In a team of $n$ agents, a single agent acting well cannot distinguish its contribution from the noise of others.

Difference reward. Define

$$D_i(s, \mathbf{a}) = R_{\text{joint}}(s, \mathbf{a}) - R_{\text{joint}}(s, \mathbf{a}_{-i}^0),$$

where $\mathbf{a}_{-i}^0$ is the joint action with agent $i$ replaced by a default (e.g., do-nothing) action. This measures how much agent $i$’s action changed the outcome. It is more informative than the global reward but requires evaluating a counterfactual, which may be expensive or ambiguous depending on the choice of default.

Shapley value. The Shapley value from cooperative game theory provides a uniquely axiomatically justified attribution:

$$\phi_i = \sum_{C \subseteq N \setminus \{i\}} \frac{|C|!(n - |C| - 1)!}{n!} \left[ v(C \cup \{i\}) - v(C) \right],$$

where $N = \{1, \ldots, n\}$ and $v(C)$ is the value (expected reward) achievable by coalition $C$. The Shapley value is the unique attribution satisfying efficiency, symmetry, linearity, and null-player axioms. Computing it exactly requires summing over all $2^n$ subsets, making it tractable only for small $n$.

Counterfactual baselines (COMA). Rather than replacing agent $i$ with a default, COMA constructs a policy-gradient update using a counterfactual baseline that marginalizes over agent $i$’s actions while holding others fixed:

$$A_i(s, \mathbf{a}) = Q(s, \mathbf{a}) - \sum_{a_i'} \pi_i(a_i' \mid s) Q(s, (a_i', \mathbf{a}_{-i})).$$

This is the advantage of agent $i$’s actual action relative to its average action, given what others are doing - a tractable approximation to the difference reward that can be computed with a centralized critic.


Communication in Multi-Agent Systems

When agents can only observe local information, coordination requires either implicit (behavioral) coordination or explicit communication.

Without communication. Each agent observes a local observation $o_i \subset s$ rather than the full state. Agents must infer the intentions, beliefs, and actions of others from their own limited view. In partially observable settings, this forces agents to develop implicit conventions through experience - they cannot simply tell each other what they know.

With communication. Agents exchange messages $m_i \in \mathcal{M}_i$ before or during decision-making. This converts a partially observable problem closer to a fully observable one. But communication raises its own questions: what should agents say? When should they say it? How compact must messages be?

Emergent communication. Rather than engineering a communication protocol, emergent communication approaches (DIAL, CommNet, TarMAC) treat the message space as a learned latent space. Agents learn both what to transmit and how to interpret received messages, jointly optimized for task performance. The resulting protocols are compact and task-relevant, but often uninterpretable to humans and brittle to changes in the environment.

The bandwidth bottleneck. In large systems, communicating full observations is infeasible. Learned communication must decide what is worth sending. Attention-based communication architectures address this by learning a soft weighting over potential messages, transmitting a weighted sum of available information rather than all of it.

Centralized training, decentralized execution (CTDE). This is the dominant paradigm in cooperative MARL. During training, a centralized module has access to the global state $s$, all agents' observations, and all agents' actions. This allows training value functions and critics that condition on joint information. During execution, each agent acts using only its local observation $o_i$. CTDE exploits the full information available in simulation without requiring a centralized controller at deployment - the key constraint in real distributed systems. QMIX, COMA, and MADDPG all instantiate this paradigm.


Scalability and the Curse of Dimensionality

The most immediate obstacle in multi-agent RL is scale. The joint action space has size $|A_1| \times \cdots \times |A_n|$, which grows exponentially in the number of agents. A team of 20 agents each with 5 discrete actions has a joint action space of $5^{20} \approx 10^{14}$. Representing and optimizing over this space directly is infeasible.

Factored value functions. VDN (Value Decomposition Networks) decomposes the joint Q-function additively:

$$Q_{\text{joint}}(s, \mathbf{a}) = \sum_{i=1}^n Q_i(o_i, a_i).$$

This is a strong structural assumption, but enables decentralized execution: each agent maximizes its own $Q_i$ independently. QMIX relaxes this to a monotone mixing:

$$Q_{\text{joint}}(s, \mathbf{a}) = f_{\text{mix}}\left(Q_1(o_1, a_1), \ldots, Q_n(o_n, a_n);\ s\right),$$

where $f_{\text{mix}}$ is a monotone function of the individual Q-values (weights constrained to be non-negative). This preserves the IGM (Individual-Global Max) property: $\operatorname{argmax}{\mathbf{a}} Q{\text{joint}} = (\operatorname{argmax}{a_1} Q_1, \ldots, \operatorname{argmax}{a_n} Q_n)$, so decentralized greedy actions are globally optimal under the factored approximation.

Mean field approximation. When agents are numerous and interact weakly, each agent’s influence on any single other agent is small. Mean field MARL replaces the effect of all other agents with a single aggregate quantity - the mean action $\bar{a} = \frac{1}{n-1}\sum_{j \neq i} a_j$ - reducing the $n$-agent interaction to a two-body problem between agent $i$ and the mean field. This reduces complexity from exponential to linear but requires the mean field assumption to hold approximately.

Attention-based methods. Transformer-style attention allows each agent to selectively aggregate information from a variable-size set of neighbors, weighting contributions by relevance. This handles heterogeneous agent populations and partial connectivity without assuming a fixed interaction structure.


Price of Anarchy

When agents act selfishly, the resulting Nash equilibrium may be far from the social optimum. The price of anarchy (PoA) quantifies this gap:

$$\text{PoA} = \frac{\text{worst Nash equilibrium cost}}{\text{optimal social cost}}$$

(Or the inverse ratio for maximization problems.) A PoA of 1 means selfish behavior is socially optimal. A PoA of 2 means the worst-case Nash equilibrium is twice as costly as the coordinated optimum.

Routing and Braess’s paradox. Selfish routing is the canonical example. Consider a network where agents choose routes to minimize their own travel time, and congestion increases travel time. The social optimum may require some agents to take longer routes so others can go faster. Self-interest prevents this coordination.

More striking: Braess’s paradox shows that adding capacity to a network can make everyone worse off. Consider two routes (up-down) from S to T, each with one high-latency fixed-cost link and one low-latency load-dependent link. All agents use a mix of routes: 30 min each. Now add a zero-cost express link between the midpoints. Every agent now has a dominant strategy to deviate to the faster path through the new link - ending up with 40 min each. The new link made everyone worse off. Several cities have removed road links and seen traffic improve: Seoul’s Cheonggyecheon highway removal, Stuttgart’s B14 closure.

PoA for load balancing. Jobs are assigned to machines selfishly (each job picks the least-loaded machine). With $m$ machines and $n$ jobs, the Nash equilibrium makespan (max load on any machine) is at most $4/3 - 1/(3m)$ times optimal. The bound of $4/3$ is tight. This means selfish scheduling is nearly as efficient as optimal centralized scheduling for this problem.

Smoothness framework. A cost-minimization game is $(\lambda, \mu)$-smooth if for any strategy profiles $s$ and $s^*$:

$$\sum_i C_i(s_i^*, s_{-i}) \leq \lambda \cdot OPT + \mu \cdot C(s)$$

Then $\text{PoA} \leq \lambda / (1 - \mu)$ for all coarse correlated equilibria (a broad solution concept containing Nash). Smoothness is a powerful technique because it gives PoA bounds that hold under uncertainty and in dynamic settings.

Price of stability. The PoA considers the worst Nash equilibrium. The price of stability (PoS) considers the best Nash equilibrium - the one a benevolent coordinator would guide agents toward. For network formation games, PoS is often O(log n) while PoA is unbounded: there exist good Nash equilibria even when bad ones also exist.


Summary

Concept Definition
Stochastic game MDP generalized to $n$ agents with individual rewards
Nash equilibrium No agent benefits from unilateral deviation
Cooperative MARL Shared reward; credit assignment is the key challenge
Zero-sum $\sum_i R_i = 0$; minimax optimal policies exist
CTDE Centralized training, decentralized execution
Credit assignment Attributing joint reward to individual agents
Emergent communication Agents learn messaging protocols end-to-end
Price of anarchy Ratio of worst Nash equilibrium cost to optimal social cost

Read next: