Jump to content
Wikipedia The Free Encyclopedia

Decentralized partially observable Markov decision process

From Wikipedia, the free encyclopedia
Model for coordination and decision-making among multiple agents

The decentralized partially observable Markov decision process (Dec-POMDP)[1] [2] is a model for coordination and decision-making among multiple agents. It is a probabilistic model that can consider uncertainty in outcomes, sensors and communication (i.e., costly, delayed, noisy or nonexistent communication).

It is a generalization of a Markov decision process (MDP) and a partially observable Markov decision process (POMDP) to consider multiple decentralized agents.[3]

Definition

[edit ]

Formal definition

[edit ]

A Dec-POMDP is a 7-tuple ( S , { A i } , T , R , { Ω i } , O , γ ) {\displaystyle (S,\{A_{i}\},T,R,\{\Omega _{i}\},O,\gamma )} {\displaystyle (S,\{A_{i}\},T,R,\{\Omega _{i}\},O,\gamma )}, where

  • S {\displaystyle S} {\displaystyle S} is a set of states,
  • A i {\displaystyle A_{i}} {\displaystyle A_{i}} is a set of actions for agent i {\displaystyle i} {\displaystyle i}, with A = × i A i {\displaystyle A=\times _{i}A_{i}} {\displaystyle A=\times _{i}A_{i}} is the set of joint actions,
  • T {\displaystyle T} {\displaystyle T} is a set of conditional transition probabilities between states, T ( s , a , s ) = P ( s s , a ) {\displaystyle T(s,a,s')=P(s'\mid s,a)} {\displaystyle T(s,a,s')=P(s'\mid s,a)},
  • R : S × A R {\displaystyle R:S\times A\to \mathbb {R} } {\displaystyle R:S\times A\to \mathbb {R} } is the reward function.
  • Ω i {\displaystyle \Omega _{i}} {\displaystyle \Omega _{i}} is a set of observations for agent i {\displaystyle i} {\displaystyle i}, with Ω = × i Ω i {\displaystyle \Omega =\times _{i}\Omega _{i}} {\displaystyle \Omega =\times _{i}\Omega _{i}} is the set of joint observations,
  • O {\displaystyle O} {\displaystyle O} is a set of conditional observation probabilities O ( s , a , o ) = P ( o s , a ) {\displaystyle O(s',a,o)=P(o\mid s',a)} {\displaystyle O(s',a,o)=P(o\mid s',a)}, and
  • γ [ 0 , 1 ] {\displaystyle \gamma \in [0,1]} {\displaystyle \gamma \in [0,1]} is the discount factor.

At each time step, each agent takes an action a i A i {\displaystyle a_{i}\in A_{i}} {\displaystyle a_{i}\in A_{i}}, the state updates based on the transition function T ( s , a , s ) {\displaystyle T(s,a,s')} {\displaystyle T(s,a,s')} (using the current state and the joint action), each agent observes an observation based on the observation function O ( s , a , o ) {\displaystyle O(s',a,o)} {\displaystyle O(s',a,o)} (using the next state and the joint action) and a reward is generated for the whole team based on the reward function R ( s , a ) {\displaystyle R(s,a)} {\displaystyle R(s,a)}. The goal is to maximize expected cumulative reward over a finite or infinite number of steps. These time steps repeat until some given horizon (called finite horizon) or forever (called infinite horizon). The discount factor γ {\displaystyle \gamma } {\displaystyle \gamma } maintains a finite sum in the infinite-horizon case ( γ [ 0 , 1 ) {\displaystyle \gamma \in [0,1)} {\displaystyle \gamma \in [0,1)}).

References

[edit ]
  1. ^ Bernstein, Daniel S.; Givan, Robert; Immerman, Neil; Zilberstein, Shlomo (November 2002). "The Complexity of Decentralized Control of Markov Decision Processes". Mathematics of Operations Research . 27 (4): 819–840. arXiv:1301.3836 . doi:10.1287/moor.27.4.819.297. ISSN 0364-765X. S2CID 1195261.
  2. ^ Oliehoek, Frans A.; Amato, Christopher (2016). A Concise Introduction to Decentralized POMDPs | SpringerLink (PDF). SpringerBriefs in Intelligent Systems. doi:10.1007/978-3-319-28929-8. ISBN 978-3-319-28927-4. S2CID 3263887.
  3. ^ Oliehoek, Frans A.; Amato, Christopher (2016年06月03日). A Concise Introduction to Decentralized POMDPs. Springer. ISBN 978-3-319-28929-8.
[edit ]

AltStyle によって変換されたページ (->オリジナル) /