Decentralized partially observable Markov decision process

Model for coordination and decision-making among multiple agents

The decentralized partially observable Markov decision process (Dec-POMDP)^[1]^[2] is a model for coordination and decision-making among multiple agents. It is a probabilistic model that can consider uncertainty in outcomes, sensors and communication (i.e., costly, delayed, noisy or nonexistent communication).

It is a generalization of a Markov decision process (MDP) and a partially observable Markov decision process (POMDP) to consider multiple decentralized agents.^[3]

Definition

[edit ]

Formal definition

[edit ]

A Dec-POMDP is a 7-tuple $(S,\{A_{i}\},T,R,\{\Omega _{i}\},O,\gamma )$ {\displaystyle (S,\{A_{i}\},T,R,\{\Omega _{i}\},O,\gamma )}, where

$S$ {\displaystyle S} is a set of states,
$A_{i}$ {\displaystyle A_{i}} is a set of actions for agent $i$ {\displaystyle i}, with $A=\times _{i}A_{i}$ {\displaystyle A=\times _{i}A_{i}} is the set of joint actions,
$T$ {\displaystyle T} is a set of conditional transition probabilities between states, $T(s,a,s')=P(s'\mid s,a)$ {\displaystyle T(s,a,s')=P(s'\mid s,a)},
$R:S\times A\to \mathbb {R}$ {\displaystyle R:S\times A\to \mathbb {R} } is the reward function.
$\Omega _{i}$ {\displaystyle \Omega _{i}} is a set of observations for agent $i$ {\displaystyle i}, with $\Omega =\times _{i}\Omega _{i}$ {\displaystyle \Omega =\times _{i}\Omega _{i}} is the set of joint observations,
$O$ {\displaystyle O} is a set of conditional observation probabilities $O(s',a,o)=P(o\mid s',a)$ {\displaystyle O(s',a,o)=P(o\mid s',a)}, and
$\gamma \in [0,1]$ {\displaystyle \gamma \in [0,1]} is the discount factor.

At each time step, each agent takes an action $a_{i}\in A_{i}$ {\displaystyle a_{i}\in A_{i}}, the state updates based on the transition function $T(s,a,s')$ {\displaystyle T(s,a,s')} (using the current state and the joint action), each agent observes an observation based on the observation function $O(s',a,o)$ {\displaystyle O(s',a,o)} (using the next state and the joint action) and a reward is generated for the whole team based on the reward function $R(s,a)$ {\displaystyle R(s,a)}. The goal is to maximize expected cumulative reward over a finite or infinite number of steps. These time steps repeat until some given horizon (called finite horizon) or forever (called infinite horizon). The discount factor $\gamma$ {\displaystyle \gamma } maintains a finite sum in the infinite-horizon case ( $\gamma \in [0,1)$ {\displaystyle \gamma \in [0,1)}).