Actor-critic algorithm

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Reinforcement learning algorithms that combine policy and value estimation

The actor-critic algorithm (AC) is a family of reinforcement learning (RL) algorithms that combine policy-based RL algorithms such as policy gradient methods, and value-based RL algorithms such as value iteration, Q-learning, SARSA, and TD learning.^[1]

An AC algorithm consists of two main components: an "actor" that determines which actions to take according to a policy function, and a "critic" that evaluates those actions according to a value function.^[2] Some AC algorithms are on-policy, some are off-policy. Some apply to either continuous or discrete action spaces. Some work in both cases.

Overview

The actor-critic methods can be understood as an improvement over pure policy gradient methods like REINFORCE via introducing a baseline.

Actor

The actor uses a policy function $\pi (a|s)$ {\displaystyle \pi (a|s)}, while the critic estimates either the value function $V(s)$ {\displaystyle V(s)}, the action-value Q-function $Q(s,a),$ {\displaystyle Q(s,a),} the advantage function $A(s,a)$ {\displaystyle A(s,a)}, or any combination thereof.

The actor is a parameterized function $\pi _{\theta }$ {\displaystyle \pi _{\theta }}, where $\theta$ {\displaystyle \theta } are the parameters of the actor. The actor takes as argument the state of the environment $s$ {\displaystyle s} and produces a probability distribution $\pi _{\theta }(\cdot |s)$ {\displaystyle \pi _{\theta }(\cdot |s)}.

If the action space is discrete, then $\sum _{a}\pi _{\theta }(a|s)=1$ {\displaystyle \sum _{a}\pi _{\theta }(a|s)=1}. If the action space is continuous, then $\int _{a}\pi _{\theta }(a|s)da=1$ {\displaystyle \int _{a}\pi _{\theta }(a|s)da=1}.

The goal of policy optimization is to improve the actor. That is, to find some $\theta$ {\displaystyle \theta } that maximizes the expected episodic reward $J(\theta )$ {\displaystyle J(\theta )}: $J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\gamma ^{t}r_{t}\right]$ {\displaystyle J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\gamma ^{t}r_{t}\right]}where $\gamma$ {\displaystyle \gamma } is the discount factor, $r_{t}$ {\displaystyle r_{t}} is the reward at step $t$ {\displaystyle t}, and $T$ {\displaystyle T} is the time-horizon (which can be infinite).

The goal of policy gradient method is to optimize $J(\theta )$ {\displaystyle J(\theta )} by gradient ascent on the policy gradient $\nabla J(\theta )$ {\displaystyle \nabla J(\theta )}.

As detailed on the policy gradient method page, there are many unbiased estimators of the policy gradient: $\nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{0\leq j\leq T}\nabla _{\theta }\ln \pi _{\theta }(A_{j}|S_{j})\cdot \Psi _{j}{\Big |}S_{0}=s_{0}\right]$ {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{0\leq j\leq T}\nabla _{\theta }\ln \pi _{\theta }(A_{j}|S_{j})\cdot \Psi _{j}{\Big |}S_{0}=s_{0}\right]}where ${\textstyle \Psi _{j}}$ {\textstyle \Psi _{j}} is a linear sum of the following:

${\textstyle \sum _{0\leq i\leq T}(\gamma ^{i}R_{i})}$ {\textstyle \sum _{0\leq i\leq T}(\gamma ^{i}R_{i})}.
${\textstyle \gamma ^{j}\sum _{j\leq i\leq T}(\gamma ^{i-j}R_{i})}$ {\textstyle \gamma ^{j}\sum _{j\leq i\leq T}(\gamma ^{i-j}R_{i})}: the REINFORCE algorithm.
${\textstyle \gamma ^{j}\sum _{j\leq i\leq T}(\gamma ^{i-j}R_{i})-b(S_{j})}$ {\textstyle \gamma ^{j}\sum _{j\leq i\leq T}(\gamma ^{i-j}R_{i})-b(S_{j})}: the REINFORCE with baseline algorithm. Here $b$ {\displaystyle b} is an arbitrary function.
${\textstyle \gamma ^{j}\left(R_{j}+\gamma V^{\pi _{\theta }}(S_{j+1})-V^{\pi _{\theta }}(S_{j})\right)}$ {\textstyle \gamma ^{j}\left(R_{j}+\gamma V^{\pi _{\theta }}(S_{j+1})-V^{\pi _{\theta }}(S_{j})\right)}: TD(1) learning.
${\textstyle \gamma ^{j}Q^{\pi _{\theta }}(S_{j},A_{j})}$ {\textstyle \gamma ^{j}Q^{\pi _{\theta }}(S_{j},A_{j})}.
${\textstyle \gamma ^{j}A^{\pi _{\theta }}(S_{j},A_{j})}$ {\textstyle \gamma ^{j}A^{\pi _{\theta }}(S_{j},A_{j})}: Advantage Actor-Critic (A2C).^[3]
${\textstyle \gamma ^{j}\left(R_{j}+\gamma R_{j+1}+\gamma ^{2}V^{\pi _{\theta }}(S_{j+2})-V^{\pi _{\theta }}(S_{j})\right)}$ {\textstyle \gamma ^{j}\left(R_{j}+\gamma R_{j+1}+\gamma ^{2}V^{\pi _{\theta }}(S_{j+2})-V^{\pi _{\theta }}(S_{j})\right)}: TD(2) learning.
${\textstyle \gamma ^{j}\left(\sum _{k=0}^{n-1}\gamma ^{k}R_{j+k}+\gamma ^{n}V^{\pi _{\theta }}(S_{j+n})-V^{\pi _{\theta }}(S_{j})\right)}$ {\textstyle \gamma ^{j}\left(\sum _{k=0}^{n-1}\gamma ^{k}R_{j+k}+\gamma ^{n}V^{\pi _{\theta }}(S_{j+n})-V^{\pi _{\theta }}(S_{j})\right)}: TD(n) learning.
${\textstyle \gamma ^{j}\sum _{n=1}^{\infty }{\frac {\lambda ^{n-1}}{1-\lambda }}\cdot \left(\sum _{k=0}^{n-1}\gamma ^{k}R_{j+k}+\gamma ^{n}V^{\pi _{\theta }}(S_{j+n})-V^{\pi _{\theta }}(S_{j})\right)}$ {\textstyle \gamma ^{j}\sum _{n=1}^{\infty }{\frac {\lambda ^{n-1}}{1-\lambda }}\cdot \left(\sum _{k=0}^{n-1}\gamma ^{k}R_{j+k}+\gamma ^{n}V^{\pi _{\theta }}(S_{j+n})-V^{\pi _{\theta }}(S_{j})\right)}: TD(λ) learning, also known as GAE (generalized advantage estimate).^[4] This is obtained by an exponentially decaying sum of the TD(n) learning terms.

Critic

In the unbiased estimators given above, certain functions such as $V^{\pi _{\theta }},Q^{\pi _{\theta }},A^{\pi _{\theta }}$ {\displaystyle V^{\pi _{\theta }},Q^{\pi _{\theta }},A^{\pi _{\theta }}} appear. These are approximated by the critic. Since these functions all depend on the actor, the critic must learn alongside the actor. The critic is learned by value-based RL algorithms.

For example, if the critic is estimating the state-value function $V^{\pi _{\theta }}(s)$ {\displaystyle V^{\pi _{\theta }}(s)}, then it can be learned by any value function approximation method. Let the critic be a function approximator $V_{\phi }(s)$ {\displaystyle V_{\phi }(s)} with parameters $\phi$ {\displaystyle \phi }.

The simplest example is TD(1) learning, which trains the critic to minimize the TD(1) error: $\delta _{i}=R_{i}+\gamma V_{\phi }(S_{i+1})-V_{\phi }(S_{i})$ {\displaystyle \delta _{i}=R_{i}+\gamma V_{\phi }(S_{i+1})-V_{\phi }(S_{i})}The critic parameters are updated by gradient descent on the squared TD error: $\phi \leftarrow \phi -\alpha \nabla _{\phi }(\delta _{i})^{2}=\phi +\alpha \delta _{i}\nabla _{\phi }V_{\phi }(S_{i})$ {\displaystyle \phi \leftarrow \phi -\alpha \nabla _{\phi }(\delta _{i})^{2}=\phi +\alpha \delta _{i}\nabla _{\phi }V_{\phi }(S_{i})}where $\alpha$ {\displaystyle \alpha } is the learning rate. Note that the gradient is taken with respect to the $\phi$ {\displaystyle \phi } in $V_{\phi }(S_{i})$ {\displaystyle V_{\phi }(S_{i})} only, since the $\phi$ {\displaystyle \phi } in $\gamma V_{\phi }(S_{i+1})$ {\displaystyle \gamma V_{\phi }(S_{i+1})} constitutes a moving target, and the gradient is not taken with respect to that. This is a common source of error in implementations that use automatic differentiation, and requires "stopping the gradient" at that point.

Similarly, if the critic is estimating the action-value function $Q^{\pi _{\theta }}$ {\displaystyle Q^{\pi _{\theta }}}, then it can be learned by Q-learning or SARSA. In SARSA, the critic maintains an estimate of the Q-function, parameterized by $\phi$ {\displaystyle \phi }, denoted as $Q_{\phi }(s,a)$ {\displaystyle Q_{\phi }(s,a)}. The temporal difference error is then calculated as $\delta _{i}=R_{i}+\gamma Q_{\theta }(S_{i+1},A_{i+1})-Q_{\theta }(S_{i},A_{i})$ {\displaystyle \delta _{i}=R_{i}+\gamma Q_{\theta }(S_{i+1},A_{i+1})-Q_{\theta }(S_{i},A_{i})}. The critic is then updated by $\theta \leftarrow \theta +\alpha \delta _{i}\nabla _{\theta }Q_{\theta }(S_{i},A_{i})$ {\displaystyle \theta \leftarrow \theta +\alpha \delta _{i}\nabla _{\theta }Q_{\theta }(S_{i},A_{i})}The advantage critic can be trained by training both a Q-function $Q_{\phi }(s,a)$ {\displaystyle Q_{\phi }(s,a)} and a state-value function $V_{\phi }(s)$ {\displaystyle V_{\phi }(s)}, then let $A_{\phi }(s,a)=Q_{\phi }(s,a)-V_{\phi }(s)$ {\displaystyle A_{\phi }(s,a)=Q_{\phi }(s,a)-V_{\phi }(s)}. Although, it is more common to train just a state-value function $V_{\phi }(s)$ {\displaystyle V_{\phi }(s)}, then estimate the advantage by^[3] $A_{\phi }(S_{i},A_{i})\approx \sum _{j\in 0:n-1}\gamma ^{j}R_{i+j}+\gamma ^{n}V_{\phi }(S_{i+n})-V_{\phi }(S_{i})$ {\displaystyle A_{\phi }(S_{i},A_{i})\approx \sum _{j\in 0:n-1}\gamma ^{j}R_{i+j}+\gamma ^{n}V_{\phi }(S_{i+n})-V_{\phi }(S_{i})}Here, $n$ {\displaystyle n} is a positive integer. The higher $n$ {\displaystyle n} is, the more lower is the bias in the advantage estimation, but at the price of higher variance.

The Generalized Advantage Estimation (GAE) introduces a hyperparameter $\lambda$ {\displaystyle \lambda } that smoothly interpolates between Monte Carlo returns ( $\lambda =1$ {\displaystyle \lambda =1}, high variance, no bias) and 1-step TD learning ( $\lambda =0$ {\displaystyle \lambda =0}, low variance, high bias). This hyperparameter can be adjusted to pick the optimal bias-variance trade-off in advantage estimation. It uses an exponentially decaying average of n-step returns with $\lambda$ {\displaystyle \lambda } being the decay strength.^[4]

Variants

Asynchronous Advantage Actor-Critic (A3C): Parallel and asynchronous version of A2C.^[3]
Soft Actor-Critic (SAC): Incorporates entropy maximization for improved exploration.^[5]
Deep Deterministic Policy Gradient (DDPG): Specialized for continuous action spaces.^[6]

References

^ Arulkumaran, Kai; Deisenroth, Marc Peter; Brundage, Miles; Bharath, Anil Anthony (November 2017). "Deep Reinforcement Learning: A Brief Survey". IEEE Signal Processing Magazine. 34 (6): 26–38. arXiv:1708.05866 . Bibcode:2017ISPM...34...26A. doi:10.1109/MSP.2017.2743240. ISSN 1053-5888.
^ Konda, Vijay; Tsitsiklis, John (1999). "Actor-Critic Algorithms". Advances in Neural Information Processing Systems. 12. MIT Press.
^ ^a ^b ^c Mnih, Volodymyr; Badia, Adrià Puigdomènech; Mirza, Mehdi; Graves, Alex; Lillicrap, Timothy P.; Harley, Tim; Silver, David; Kavukcuoglu, Koray (2016年06月16日), Asynchronous Methods for Deep Reinforcement Learning, arXiv:1602.01783
^ ^a ^b Schulman, John; Moritz, Philipp; Levine, Sergey; Jordan, Michael; Abbeel, Pieter (2018年10月20日), High-Dimensional Continuous Control Using Generalized Advantage Estimation, arXiv:1506.02438
^ Haarnoja, Tuomas; Zhou, Aurick; Hartikainen, Kristian; Tucker, George; Ha, Sehoon; Tan, Jie; Kumar, Vikash; Zhu, Henry; Gupta, Abhishek (2019年01月29日), Soft Actor-Critic Algorithms and Applications, arXiv:1812.05905
^ Lillicrap, Timothy P.; Hunt, Jonathan J.; Pritzel, Alexander; Heess, Nicolas; Erez, Tom; Tassa, Yuval; Silver, David; Wierstra, Daan (2019年07月05日), Continuous control with deep reinforcement learning, arXiv:1509.02971

Konda, Vijay R.; Tsitsiklis, John N. (January 2003). "On Actor-Critic Algorithms" . SIAM Journal on Control and Optimization. 42 (4): 1143–1166. doi:10.1137/S0363012901385691. ISSN 0363-0129.
Sutton, Richard S.; Barto, Andrew G. (2018). Reinforcement learning: an introduction. Adaptive computation and machine learning series (2 ed.). Cambridge, Massachusetts: The MIT Press. ISBN 978-0-262-03924-6.
Bertsekas, Dimitri P. (2019). Reinforcement learning and optimal control (2 ed.). Belmont, Massachusetts: Athena Scientific. ISBN 978-1-886529-39-7.
Grossi, Csaba (2010). Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning (1 ed.). Cham: Springer International Publishing. ISBN 978-3-031-00423-0.
Grondman, Ivo; Busoniu, Lucian; Lopes, Gabriel A. D.; Babuska, Robert (November 2012). "A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients". IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and Reviews. 42 (6): 1291–1307. Bibcode:2012ITHMS..42.1291G. doi:10.1109/TSMCC.2012.2218595. ISSN 1094-6977.

v
t
e

Artificial intelligence (AI)

Concepts

Applications

Implementations

Audio–visual	AlexNet WaveNet Human image synthesis HWR OCR Computer vision Speech synthesis 15.ai ElevenLabs Speech recognition Whisper Facial recognition AlphaFold Text-to-image models Aurora DALL-E Firefly Flux Ideogram Imagen Midjourney Recraft Stable Diffusion Text-to-video models Dream Machine Runway Gen Hailuo AI Kling Sora Veo Music generation Riffusion Suno AI Udio
Text	Word2vec Seq2seq GloVe BERT T5 Llama Chinchilla AI PaLM GPT 1 2 3 J ChatGPT 4 4o o1 o3 4.5 4.1 o4-mini 5 Claude Gemini Gemini (language model) Gemma Grok LaMDA BLOOM DBRX Project Debater IBM Watson IBM Watsonx Granite PanGu-Σ DeepSeek Qwen
Decisional	AlphaGo AlphaZero OpenAI Five Self-driving car MuZero Action selection AutoGPT Robot control

People

Architectures

Category

Retrieved from "https://en.wikipedia.org/w/index.php?title=Actor-critic_algorithm&oldid=1306938178"

Overview

Actor

Critic

Variants

See also

References