Actor-critic algorithm
The actor-critic algorithm (AC) is a family of reinforcement learning (RL) algorithms that combine policy-based RL algorithms such as policy gradient methods, and value-based RL algorithms such as value iteration, Q-learning, SARSA, and TD learning.[1]
An AC algorithm consists of two main components: an "actor" that determines which actions to take according to a policy function, and a "critic" that evaluates those actions according to a value function.[2] Some AC algorithms are on-policy, some are off-policy. Some apply to either continuous or discrete action spaces. Some work in both cases.
Overview
[edit ]The actor-critic methods can be understood as an improvement over pure policy gradient methods like REINFORCE via introducing a baseline.
Actor
[edit ]The actor uses a policy function {\displaystyle \pi (a|s)}, while the critic estimates either the value function {\displaystyle V(s)}, the action-value Q-function {\displaystyle Q(s,a),} the advantage function {\displaystyle A(s,a)}, or any combination thereof.
The actor is a parameterized function {\displaystyle \pi _{\theta }}, where {\displaystyle \theta } are the parameters of the actor. The actor takes as argument the state of the environment {\displaystyle s} and produces a probability distribution {\displaystyle \pi _{\theta }(\cdot |s)}.
If the action space is discrete, then {\displaystyle \sum _{a}\pi _{\theta }(a|s)=1}. If the action space is continuous, then {\displaystyle \int _{a}\pi _{\theta }(a|s)da=1}.
The goal of policy optimization is to improve the actor. That is, to find some {\displaystyle \theta } that maximizes the expected episodic reward {\displaystyle J(\theta )}:{\displaystyle J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\gamma ^{t}r_{t}\right]}where {\displaystyle \gamma } is the discount factor, {\displaystyle r_{t}} is the reward at step {\displaystyle t}, and {\displaystyle T} is the time-horizon (which can be infinite).
The goal of policy gradient method is to optimize {\displaystyle J(\theta )} by gradient ascent on the policy gradient {\displaystyle \nabla J(\theta )}.
As detailed on the policy gradient method page, there are many unbiased estimators of the policy gradient:{\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{0\leq j\leq T}\nabla _{\theta }\ln \pi _{\theta }(A_{j}|S_{j})\cdot \Psi _{j}{\Big |}S_{0}=s_{0}\right]}where {\textstyle \Psi _{j}} is a linear sum of the following:
- {\textstyle \sum _{0\leq i\leq T}(\gamma ^{i}R_{i})}.
- {\textstyle \gamma ^{j}\sum _{j\leq i\leq T}(\gamma ^{i-j}R_{i})}: the REINFORCE algorithm.
- {\textstyle \gamma ^{j}\sum _{j\leq i\leq T}(\gamma ^{i-j}R_{i})-b(S_{j})}: the REINFORCE with baseline algorithm. Here {\displaystyle b} is an arbitrary function.
- {\textstyle \gamma ^{j}\left(R_{j}+\gamma V^{\pi _{\theta }}(S_{j+1})-V^{\pi _{\theta }}(S_{j})\right)}: TD(1) learning.
- {\textstyle \gamma ^{j}Q^{\pi _{\theta }}(S_{j},A_{j})}.
- {\textstyle \gamma ^{j}A^{\pi _{\theta }}(S_{j},A_{j})}: Advantage Actor-Critic (A2C).[3]
- {\textstyle \gamma ^{j}\left(R_{j}+\gamma R_{j+1}+\gamma ^{2}V^{\pi _{\theta }}(S_{j+2})-V^{\pi _{\theta }}(S_{j})\right)}: TD(2) learning.
- {\textstyle \gamma ^{j}\left(\sum _{k=0}^{n-1}\gamma ^{k}R_{j+k}+\gamma ^{n}V^{\pi _{\theta }}(S_{j+n})-V^{\pi _{\theta }}(S_{j})\right)}: TD(n) learning.
- {\textstyle \gamma ^{j}\sum _{n=1}^{\infty }{\frac {\lambda ^{n-1}}{1-\lambda }}\cdot \left(\sum _{k=0}^{n-1}\gamma ^{k}R_{j+k}+\gamma ^{n}V^{\pi _{\theta }}(S_{j+n})-V^{\pi _{\theta }}(S_{j})\right)}: TD(λ) learning, also known as GAE (generalized advantage estimate).[4] This is obtained by an exponentially decaying sum of the TD(n) learning terms.
Critic
[edit ]In the unbiased estimators given above, certain functions such as {\displaystyle V^{\pi _{\theta }},Q^{\pi _{\theta }},A^{\pi _{\theta }}} appear. These are approximated by the critic. Since these functions all depend on the actor, the critic must learn alongside the actor. The critic is learned by value-based RL algorithms.
For example, if the critic is estimating the state-value function {\displaystyle V^{\pi _{\theta }}(s)}, then it can be learned by any value function approximation method. Let the critic be a function approximator {\displaystyle V_{\phi }(s)} with parameters {\displaystyle \phi }.
The simplest example is TD(1) learning, which trains the critic to minimize the TD(1) error:{\displaystyle \delta _{i}=R_{i}+\gamma V_{\phi }(S_{i+1})-V_{\phi }(S_{i})}The critic parameters are updated by gradient descent on the squared TD error:{\displaystyle \phi \leftarrow \phi -\alpha \nabla _{\phi }(\delta _{i})^{2}=\phi +\alpha \delta _{i}\nabla _{\phi }V_{\phi }(S_{i})}where {\displaystyle \alpha } is the learning rate. Note that the gradient is taken with respect to the {\displaystyle \phi } in {\displaystyle V_{\phi }(S_{i})} only, since the {\displaystyle \phi } in {\displaystyle \gamma V_{\phi }(S_{i+1})} constitutes a moving target, and the gradient is not taken with respect to that. This is a common source of error in implementations that use automatic differentiation, and requires "stopping the gradient" at that point.
Similarly, if the critic is estimating the action-value function {\displaystyle Q^{\pi _{\theta }}}, then it can be learned by Q-learning or SARSA. In SARSA, the critic maintains an estimate of the Q-function, parameterized by {\displaystyle \phi }, denoted as {\displaystyle Q_{\phi }(s,a)}. The temporal difference error is then calculated as {\displaystyle \delta _{i}=R_{i}+\gamma Q_{\theta }(S_{i+1},A_{i+1})-Q_{\theta }(S_{i},A_{i})}. The critic is then updated by{\displaystyle \theta \leftarrow \theta +\alpha \delta _{i}\nabla _{\theta }Q_{\theta }(S_{i},A_{i})}The advantage critic can be trained by training both a Q-function {\displaystyle Q_{\phi }(s,a)} and a state-value function {\displaystyle V_{\phi }(s)}, then let {\displaystyle A_{\phi }(s,a)=Q_{\phi }(s,a)-V_{\phi }(s)}. Although, it is more common to train just a state-value function {\displaystyle V_{\phi }(s)}, then estimate the advantage by[3] {\displaystyle A_{\phi }(S_{i},A_{i})\approx \sum _{j\in 0:n-1}\gamma ^{j}R_{i+j}+\gamma ^{n}V_{\phi }(S_{i+n})-V_{\phi }(S_{i})}Here, {\displaystyle n} is a positive integer. The higher {\displaystyle n} is, the more lower is the bias in the advantage estimation, but at the price of higher variance.
The Generalized Advantage Estimation (GAE) introduces a hyperparameter {\displaystyle \lambda } that smoothly interpolates between Monte Carlo returns ({\displaystyle \lambda =1}, high variance, no bias) and 1-step TD learning ({\displaystyle \lambda =0}, low variance, high bias). This hyperparameter can be adjusted to pick the optimal bias-variance trade-off in advantage estimation. It uses an exponentially decaying average of n-step returns with {\displaystyle \lambda } being the decay strength.[4]
Variants
[edit ]- Asynchronous Advantage Actor-Critic (A3C): Parallel and asynchronous version of A2C.[3]
- Soft Actor-Critic (SAC): Incorporates entropy maximization for improved exploration.[5]
- Deep Deterministic Policy Gradient (DDPG): Specialized for continuous action spaces.[6]
See also
[edit ]References
[edit ]- ^ Arulkumaran, Kai; Deisenroth, Marc Peter; Brundage, Miles; Bharath, Anil Anthony (November 2017). "Deep Reinforcement Learning: A Brief Survey". IEEE Signal Processing Magazine. 34 (6): 26–38. arXiv:1708.05866 . Bibcode:2017ISPM...34...26A. doi:10.1109/MSP.2017.2743240. ISSN 1053-5888.
- ^ Konda, Vijay; Tsitsiklis, John (1999). "Actor-Critic Algorithms". Advances in Neural Information Processing Systems. 12. MIT Press.
- ^ a b c Mnih, Volodymyr; Badia, Adrià Puigdomènech; Mirza, Mehdi; Graves, Alex; Lillicrap, Timothy P.; Harley, Tim; Silver, David; Kavukcuoglu, Koray (2016年06月16日), Asynchronous Methods for Deep Reinforcement Learning, arXiv:1602.01783
- ^ a b Schulman, John; Moritz, Philipp; Levine, Sergey; Jordan, Michael; Abbeel, Pieter (2018年10月20日), High-Dimensional Continuous Control Using Generalized Advantage Estimation, arXiv:1506.02438
- ^ Haarnoja, Tuomas; Zhou, Aurick; Hartikainen, Kristian; Tucker, George; Ha, Sehoon; Tan, Jie; Kumar, Vikash; Zhu, Henry; Gupta, Abhishek (2019年01月29日), Soft Actor-Critic Algorithms and Applications, arXiv:1812.05905
- ^ Lillicrap, Timothy P.; Hunt, Jonathan J.; Pritzel, Alexander; Heess, Nicolas; Erez, Tom; Tassa, Yuval; Silver, David; Wierstra, Daan (2019年07月05日), Continuous control with deep reinforcement learning, arXiv:1509.02971
- Konda, Vijay R.; Tsitsiklis, John N. (January 2003). "On Actor-Critic Algorithms" . SIAM Journal on Control and Optimization. 42 (4): 1143–1166. doi:10.1137/S0363012901385691. ISSN 0363-0129.
- Sutton, Richard S.; Barto, Andrew G. (2018). Reinforcement learning: an introduction. Adaptive computation and machine learning series (2 ed.). Cambridge, Massachusetts: The MIT Press. ISBN 978-0-262-03924-6.
- Bertsekas, Dimitri P. (2019). Reinforcement learning and optimal control (2 ed.). Belmont, Massachusetts: Athena Scientific. ISBN 978-1-886529-39-7.
- Grossi, Csaba (2010). Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning (1 ed.). Cham: Springer International Publishing. ISBN 978-3-031-00423-0.
- Grondman, Ivo; Busoniu, Lucian; Lopes, Gabriel A. D.; Babuska, Robert (November 2012). "A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients". IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and Reviews. 42 (6): 1291–1307. Bibcode:2012ITHMS..42.1291G. doi:10.1109/TSMCC.2012.2218595. ISSN 1094-6977.