Actor-critic algorithm

Reinforcement learning algorithms that combine policy and value estimation

The actor-critic algorithm (AC) is a family of reinforcement learning (RL) algorithms that combine policy-based RL algorithms such as policy gradient methods, and value-based RL algorithms such as value iteration, Q-learning, SARSA, and TD learning.^[1]

An AC algorithm consists of two main components: an "actor" that determines which actions to take according to a policy function, and a "critic" that evaluates those actions according to a value function.^[2] Some AC algorithms are on-policy, some are off-policy. Some apply to either continuous or discrete action spaces. Some work in both cases.

Overview

[edit ]

The actor-critic methods can be understood as an improvement over pure policy gradient methods like REINFORCE via introducing a baseline.

Actor

[edit ]

The actor uses a policy function $\pi (a|s)$ {\displaystyle \pi (a|s)}, while the critic estimates either the value function $V(s)$ {\displaystyle V(s)}, the action-value Q-function $Q(s,a),$ {\displaystyle Q(s,a),} the advantage function $A(s,a)$ {\displaystyle A(s,a)}, or any combination thereof.

The actor is a parameterized function $\pi _{\theta }$ {\displaystyle \pi _{\theta }}, where $\theta$ {\displaystyle \theta } are the parameters of the actor. The actor takes as argument the state of the environment $s$ {\displaystyle s} and produces a probability distribution $\pi _{\theta }(\cdot |s)$ {\displaystyle \pi _{\theta }(\cdot |s)}.

If the action space is discrete, then $\sum _{a}\pi _{\theta }(a|s)=1$ {\displaystyle \sum _{a}\pi _{\theta }(a|s)=1}. If the action space is continuous, then $\int _{a}\pi _{\theta }(a|s)da=1$ {\displaystyle \int _{a}\pi _{\theta }(a|s)da=1}.

The goal of policy optimization is to improve the actor. That is, to find some $\theta$ {\displaystyle \theta } that maximizes the expected episodic reward $J(\theta )$ {\displaystyle J(\theta )}: $J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\gamma ^{t}r_{t}\right]$ {\displaystyle J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\gamma ^{t}r_{t}\right]}where $\gamma$ {\displaystyle \gamma } is the discount factor, $r_{t}$ {\displaystyle r_{t}} is the reward at step $t$ {\displaystyle t}, and $T$ {\displaystyle T} is the time-horizon (which can be infinite).

The goal of policy gradient method is to optimize $J(\theta )$ {\displaystyle J(\theta )} by gradient ascent on the policy gradient $\nabla J(\theta )$ {\displaystyle \nabla J(\theta )}.

As detailed on the policy gradient method page, there are many unbiased estimators of the policy gradient: $\nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{0\leq j\leq T}\nabla _{\theta }\ln \pi _{\theta }(A_{j}|S_{j})\cdot \Psi _{j}{\Big |}S_{0}=s_{0}\right]$ {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{0\leq j\leq T}\nabla _{\theta }\ln \pi _{\theta }(A_{j}|S_{j})\cdot \Psi _{j}{\Big |}S_{0}=s_{0}\right]}where ${\textstyle \Psi _{j}}$ {\textstyle \Psi _{j}} is a linear sum of the following:

${\textstyle \sum _{0\leq i\leq T}(\gamma ^{i}R_{i})}$ {\textstyle \sum _{0\leq i\leq T}(\gamma ^{i}R_{i})}.
${\textstyle \gamma ^{j}\sum _{j\leq i\leq T}(\gamma ^{i-j}R_{i})}$ {\textstyle \gamma ^{j}\sum _{j\leq i\leq T}(\gamma ^{i-j}R_{i})}: the REINFORCE algorithm.
${\textstyle \gamma ^{j}\sum _{j\leq i\leq T}(\gamma ^{i-j}R_{i})-b(S_{j})}$ {\textstyle \gamma ^{j}\sum _{j\leq i\leq T}(\gamma ^{i-j}R_{i})-b(S_{j})}: the REINFORCE with baseline algorithm. Here $b$ {\displaystyle b} is an arbitrary function.
${\textstyle \gamma ^{j}\left(R_{j}+\gamma V^{\pi _{\theta }}(S_{j+1})-V^{\pi _{\theta }}(S_{j})\right)}$ {\textstyle \gamma ^{j}\left(R_{j}+\gamma V^{\pi _{\theta }}(S_{j+1})-V^{\pi _{\theta }}(S_{j})\right)}: TD(1) learning.
${\textstyle \gamma ^{j}Q^{\pi _{\theta }}(S_{j},A_{j})}$ {\textstyle \gamma ^{j}Q^{\pi _{\theta }}(S_{j},A_{j})}.
${\textstyle \gamma ^{j}A^{\pi _{\theta }}(S_{j},A_{j})}$ {\textstyle \gamma ^{j}A^{\pi _{\theta }}(S_{j},A_{j})}: Advantage Actor-Critic (A2C).^[3]
${\textstyle \gamma ^{j}\left(R_{j}+\gamma R_{j+1}+\gamma ^{2}V^{\pi _{\theta }}(S_{j+2})-V^{\pi _{\theta }}(S_{j})\right)}$ {\textstyle \gamma ^{j}\left(R_{j}+\gamma R_{j+1}+\gamma ^{2}V^{\pi _{\theta }}(S_{j+2})-V^{\pi _{\theta }}(S_{j})\right)}: TD(2) learning.
${\textstyle \gamma ^{j}\left(\sum _{k=0}^{n-1}\gamma ^{k}R_{j+k}+\gamma ^{n}V^{\pi _{\theta }}(S_{j+n})-V^{\pi _{\theta }}(S_{j})\right)}$ {\textstyle \gamma ^{j}\left(\sum _{k=0}^{n-1}\gamma ^{k}R_{j+k}+\gamma ^{n}V^{\pi _{\theta }}(S_{j+n})-V^{\pi _{\theta }}(S_{j})\right)}: TD(n) learning.
${\textstyle \gamma ^{j}\sum _{n=1}^{\infty }{\frac {\lambda ^{n-1}}{1-\lambda }}\cdot \left(\sum _{k=0}^{n-1}\gamma ^{k}R_{j+k}+\gamma ^{n}V^{\pi _{\theta }}(S_{j+n})-V^{\pi _{\theta }}(S_{j})\right)}$ {\textstyle \gamma ^{j}\sum _{n=1}^{\infty }{\frac {\lambda ^{n-1}}{1-\lambda }}\cdot \left(\sum _{k=0}^{n-1}\gamma ^{k}R_{j+k}+\gamma ^{n}V^{\pi _{\theta }}(S_{j+n})-V^{\pi _{\theta }}(S_{j})\right)}: TD(λ) learning, also known as GAE (generalized advantage estimate).^[4] This is obtained by an exponentially decaying sum of the TD(n) learning terms.

Critic

[edit ]

In the unbiased estimators given above, certain functions such as $V^{\pi _{\theta }},Q^{\pi _{\theta }},A^{\pi _{\theta }}$ {\displaystyle V^{\pi _{\theta }},Q^{\pi _{\theta }},A^{\pi _{\theta }}} appear. These are approximated by the critic. Since these functions all depend on the actor, the critic must learn alongside the actor. The critic is learned by value-based RL algorithms.

For example, if the critic is estimating the state-value function $V^{\pi _{\theta }}(s)$ {\displaystyle V^{\pi _{\theta }}(s)}, then it can be learned by any value function approximation method. Let the critic be a function approximator $V_{\phi }(s)$ {\displaystyle V_{\phi }(s)} with parameters $\phi$ {\displaystyle \phi }.

The simplest example is TD(1) learning, which trains the critic to minimize the TD(1) error: $\delta _{i}=R_{i}+\gamma V_{\phi }(S_{i+1})-V_{\phi }(S_{i})$ {\displaystyle \delta _{i}=R_{i}+\gamma V_{\phi }(S_{i+1})-V_{\phi }(S_{i})}The critic parameters are updated by gradient descent on the squared TD error: $\phi \leftarrow \phi -\alpha \nabla _{\phi }(\delta _{i})^{2}=\phi +\alpha \delta _{i}\nabla _{\phi }V_{\phi }(S_{i})$ {\displaystyle \phi \leftarrow \phi -\alpha \nabla _{\phi }(\delta _{i})^{2}=\phi +\alpha \delta _{i}\nabla _{\phi }V_{\phi }(S_{i})}where $\alpha$ {\displaystyle \alpha } is the learning rate. Note that the gradient is taken with respect to the $\phi$ {\displaystyle \phi } in $V_{\phi }(S_{i})$ {\displaystyle V_{\phi }(S_{i})} only, since the $\phi$ {\displaystyle \phi } in $\gamma V_{\phi }(S_{i+1})$ {\displaystyle \gamma V_{\phi }(S_{i+1})} constitutes a moving target, and the gradient is not taken with respect to that. This is a common source of error in implementations that use automatic differentiation, and requires "stopping the gradient" at that point.

Similarly, if the critic is estimating the action-value function $Q^{\pi _{\theta }}$ {\displaystyle Q^{\pi _{\theta }}}, then it can be learned by Q-learning or SARSA. In SARSA, the critic maintains an estimate of the Q-function, parameterized by $\phi$ {\displaystyle \phi }, denoted as $Q_{\phi }(s,a)$ {\displaystyle Q_{\phi }(s,a)}. The temporal difference error is then calculated as $\delta _{i}=R_{i}+\gamma Q_{\theta }(S_{i+1},A_{i+1})-Q_{\theta }(S_{i},A_{i})$ {\displaystyle \delta _{i}=R_{i}+\gamma Q_{\theta }(S_{i+1},A_{i+1})-Q_{\theta }(S_{i},A_{i})}. The critic is then updated by $\theta \leftarrow \theta +\alpha \delta _{i}\nabla _{\theta }Q_{\theta }(S_{i},A_{i})$ {\displaystyle \theta \leftarrow \theta +\alpha \delta _{i}\nabla _{\theta }Q_{\theta }(S_{i},A_{i})}The advantage critic can be trained by training both a Q-function $Q_{\phi }(s,a)$ {\displaystyle Q_{\phi }(s,a)} and a state-value function $V_{\phi }(s)$ {\displaystyle V_{\phi }(s)}, then let $A_{\phi }(s,a)=Q_{\phi }(s,a)-V_{\phi }(s)$ {\displaystyle A_{\phi }(s,a)=Q_{\phi }(s,a)-V_{\phi }(s)}. Although, it is more common to train just a state-value function $V_{\phi }(s)$ {\displaystyle V_{\phi }(s)}, then estimate the advantage by^[3] $A_{\phi }(S_{i},A_{i})\approx \sum _{j\in 0:n-1}\gamma ^{j}R_{i+j}+\gamma ^{n}V_{\phi }(S_{i+n})-V_{\phi }(S_{i})$ {\displaystyle A_{\phi }(S_{i},A_{i})\approx \sum _{j\in 0:n-1}\gamma ^{j}R_{i+j}+\gamma ^{n}V_{\phi }(S_{i+n})-V_{\phi }(S_{i})}Here, $n$ {\displaystyle n} is a positive integer. The higher $n$ {\displaystyle n} is, the more lower is the bias in the advantage estimation, but at the price of higher variance.

The Generalized Advantage Estimation (GAE) introduces a hyperparameter $\lambda$ {\displaystyle \lambda } that smoothly interpolates between Monte Carlo returns ( $\lambda =1$ {\displaystyle \lambda =1}, high variance, no bias) and 1-step TD learning ( $\lambda =0$ {\displaystyle \lambda =0}, low variance, high bias). This hyperparameter can be adjusted to pick the optimal bias-variance trade-off in advantage estimation. It uses an exponentially decaying average of n-step returns with $\lambda$ {\displaystyle \lambda } being the decay strength.^[4]

Variants

[edit ]

Asynchronous Advantage Actor-Critic (A3C): Parallel and asynchronous version of A2C.^[3]
Soft Actor-Critic (SAC): Incorporates entropy maximization for improved exploration.^[5]
Deep Deterministic Policy Gradient (DDPG): Specialized for continuous action spaces.^[6]

References

[edit ]

^ Arulkumaran, Kai; Deisenroth, Marc Peter; Brundage, Miles; Bharath, Anil Anthony (November 2017). "Deep Reinforcement Learning: A Brief Survey". IEEE Signal Processing Magazine. 34 (6): 26–38. arXiv:1708.05866 . Bibcode:2017ISPM...34...26A. doi:10.1109/MSP.2017.2743240. ISSN 1053-5888.
^ Konda, Vijay; Tsitsiklis, John (1999). "Actor-Critic Algorithms". Advances in Neural Information Processing Systems. 12. MIT Press.
^ ^a ^b ^c Mnih, Volodymyr; Badia, Adrià Puigdomènech; Mirza, Mehdi; Graves, Alex; Lillicrap, Timothy P.; Harley, Tim; Silver, David; Kavukcuoglu, Koray (2016年06月16日), Asynchronous Methods for Deep Reinforcement Learning, arXiv:1602.01783
^ ^a ^b Schulman, John; Moritz, Philipp; Levine, Sergey; Jordan, Michael; Abbeel, Pieter (2018年10月20日), High-Dimensional Continuous Control Using Generalized Advantage Estimation, arXiv:1506.02438
^ Haarnoja, Tuomas; Zhou, Aurick; Hartikainen, Kristian; Tucker, George; Ha, Sehoon; Tan, Jie; Kumar, Vikash; Zhu, Henry; Gupta, Abhishek (2019年01月29日), Soft Actor-Critic Algorithms and Applications, arXiv:1812.05905
^ Lillicrap, Timothy P.; Hunt, Jonathan J.; Pritzel, Alexander; Heess, Nicolas; Erez, Tom; Tassa, Yuval; Silver, David; Wierstra, Daan (2019年07月05日), Continuous control with deep reinforcement learning, arXiv:1509.02971

Konda, Vijay R.; Tsitsiklis, John N. (January 2003). "On Actor-Critic Algorithms" . SIAM Journal on Control and Optimization. 42 (4): 1143–1166. doi:10.1137/S0363012901385691. ISSN 0363-0129.
Sutton, Richard S.; Barto, Andrew G. (2018). Reinforcement learning: an introduction. Adaptive computation and machine learning series (2 ed.). Cambridge, Massachusetts: The MIT Press. ISBN 978-0-262-03924-6.
Bertsekas, Dimitri P. (2019). Reinforcement learning and optimal control (2 ed.). Belmont, Massachusetts: Athena Scientific. ISBN 978-1-886529-39-7.
Grossi, Csaba (2010). Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning (1 ed.). Cham: Springer International Publishing. ISBN 978-3-031-00423-0.
Grondman, Ivo; Busoniu, Lucian; Lopes, Gabriel A. D.; Babuska, Robert (November 2012). "A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients". IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and Reviews. 42 (6): 1291–1307. Bibcode:2012ITHMS..42.1291G. doi:10.1109/TSMCC.2012.2218595. ISSN 1094-6977.

v
t
e

Artificial intelligence (AI)

Concepts

Applications

Implementations

Audio–visual	AlexNet WaveNet Human image synthesis HWR OCR Computer vision Speech synthesis 15.ai ElevenLabs Speech recognition Whisper Facial recognition AlphaFold Text-to-image models Aurora DALL-E Firefly Flux Ideogram Imagen Midjourney Recraft Stable Diffusion Text-to-video models Dream Machine Runway Gen Hailuo AI Kling Sora Veo Music generation Riffusion Suno AI Udio
Text	Word2vec Seq2seq GloVe BERT T5 Llama Chinchilla AI PaLM GPT 1 2 3 J ChatGPT 4 4o o1 o3 4.5 4.1 o4-mini 5 Claude Gemini Gemini (language model) Gemma Grok LaMDA BLOOM DBRX Project Debater IBM Watson IBM Watsonx Granite PanGu-Σ DeepSeek Qwen
Decisional	AlphaGo AlphaZero OpenAI Five Self-driving car MuZero Action selection AutoGPT Robot control

People

Architectures

Category

Retrieved from "https://en.wikipedia.org/w/index.php?title=Actor-critic_algorithm&oldid=1306938178"

Overview

Actor

Critic

Variants

See also

References