|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "# Q Learning 介绍\n", |
| 8 | + "在增强学习中,有一种很有名的算法,叫做 q-learning,我们下面会从原理入手,然后通过一个简单的小例子讲一讲 q-learning。\n", |
| 9 | + "\n", |
| 10 | + "## q-learning 的原理\n", |
| 11 | + "我们使用一个简单的例子来导入 q-learning,假设一个屋子有 5 个房间,某一些房间之间相连,我们希望能够走出这个房间,示意图如下\n", |
| 12 | + "\n", |
| 13 | + "" |
| 14 | + ] |
| 15 | + }, |
| 16 | + { |
| 17 | + "cell_type": "markdown", |
| 18 | + "metadata": {}, |
| 19 | + "source": [ |
| 20 | + "那么我们可以将其简化成一些节点和图的形式,每个房间作为一个节点,两个房间有门相连,就在两个节点之间连接一条线,可以得到下面的图片\n", |
| 21 | + "\n", |
| 22 | + "" |
| 23 | + ] |
| 24 | + }, |
| 25 | + { |
| 26 | + "cell_type": "markdown", |
| 27 | + "metadata": {}, |
| 28 | + "source": [ |
| 29 | + "为了模拟整个过程,我们放置一个智能体在任何一个房间,希望它能够走出这个房间,也就是说希望其能够走到了 5 号节点。为了能够让智能体知道 5 号节点是目标房间,我们需要设置一些奖励,对于每一条边,我们都关联一个奖励值:直接连到目标房间的边的奖励值设置为 100,其他的边可以设置为 0,注意 5 号房间有一个指向自己的箭头,奖励值也设置为 100,其他直接指向 5 号房间的也设置为 100,这样当智能体到达 5 号房间之后,他就会选择一只待在 5 号房间,这也称为吸收目标,效果如下\n", |
| 30 | + "\n", |
| 31 | + "" |
| 32 | + ] |
| 33 | + }, |
| 34 | + { |
| 35 | + "cell_type": "markdown", |
| 36 | + "metadata": {}, |
| 37 | + "source": [ |
| 38 | + "想想一下智能体可以不断学习,每次我们将其放在其中一个房间,然后它可以不断探索,根据奖励值走到 5 号房间,也就是走出这个屋子。比如现在这个智能体在 2 号房间,我们希望其能够不断探索走到 5 号房间。\n", |
| 39 | + "\n", |
| 40 | + "### 状态和动作\n", |
| 41 | + "q-learning 中有两个重要的概念,一个是状态,一个是动作,我们将每一个房间都称为一个状态,而智能体从一个房间走到另外一个房间称为一个动作,对应于上面的图就是每个节点是一个状态,每一个箭头都是一种行动。假如智能体处在状态 4,从状态 4 其可以选择走到状态 0,或者状态 3 或者状态 5,如果其走到了状态 3,也可以选择走到状态 2 或者状态 1 或者 状态 4。\n", |
| 42 | + "\n", |
| 43 | + "我们可以根据状态和动作得到的奖励来建立一个奖励表,用 -1 表示相应节点之间没有边相连,而没有到达终点的边奖励都记为 0,如下\n", |
| 44 | + "\n", |
| 45 | + "" |
| 46 | + ] |
| 47 | + }, |
| 48 | + { |
| 49 | + "cell_type": "markdown", |
| 50 | + "metadata": {}, |
| 51 | + "source": [ |
| 52 | + "类似的,我们可以让智能体通过和环境的交互来不断学习环境中的知识,让智能体根据每个状态来估计每种行动可能得到的收益,这个矩阵被称为 Q 表,每一行表示状态,每一列表示不同的动作,对于状态未知的情景,我们可以随机让智能体从任何的位置出发,然后去探索新的环境来尽可能的得到所有的状态。刚开始智能体对于环境一无所知,所以数值全部初始化为 0,如下\n", |
| 53 | + "\n", |
| 54 | + "\n", |
| 55 | + "\n", |
| 56 | + "我们的智能体通过不断地学习来更新 Q 表中的结果,最后依据 Q 表中的值来做决策。" |
| 57 | + ] |
| 58 | + }, |
| 59 | + { |
| 60 | + "cell_type": "markdown", |
| 61 | + "metadata": {}, |
| 62 | + "source": [ |
| 63 | + "### Q-learning 算法\n", |
| 64 | + "有了奖励表和 Q 表,我们需要知道智能体是如何通过学习来更新 Q 表,以便最后能够根据 Q 表进行决策,这个时候就需要讲一讲 Q-learning 的算法。\n", |
| 65 | + "\n", |
| 66 | + "Q-learning 的算法特别简单,状态转移公式如下\n", |
| 67 | + "\n", |
| 68 | + "$$Q(s, a) = R(s, a) + \\gamma \\mathop{max}_{\\tilde{a}}\\{ Q(\\tilde{s}, \\tilde{a}) \\}$$\n", |
| 69 | + "\n", |
| 70 | + "其中 s, a 表示当前的状态和行动,$\\tilde{s}, \\tilde{a}$ 分别表示 s 采取 a 的动作之后的下一个状态和该状态对应所有的行动,参数 $\\gamma$ 是一个常数,0ドル \\leq \\gamma \\le 1 $表示对未来奖励的一个衰减程度,形象地比喻就是一个人对于未来的远见程度。\n", |
| 71 | + "\n", |
| 72 | + "解释一下就是智能体通过经验进行自主学习,不断从一个状态转移到另外一个状态进行探索,并在这个过程中不断更新 Q 表,直到到达目标位置,Q 表就像智能体的大脑,更新越多就越强。我们称智能体的每一次探索为 episode,每个 episode 都表示智能体从任意初始状态到达目标状态,当智能体到达一个目标状态,那么当前的 episode 结束,进入下一个 episode。" |
| 73 | + ] |
| 74 | + }, |
| 75 | + { |
| 76 | + "cell_type": "markdown", |
| 77 | + "metadata": {}, |
| 78 | + "source": [ |
| 79 | + "下面给出 q-learning 的整个算法流程\n", |
| 80 | + "- step1 给定参数 $\\gamma$ 和奖励矩阵 R\n", |
| 81 | + "- step2 令 Q:= 0\n", |
| 82 | + "- step3 For each episode:\n", |
| 83 | + " - 3.1 随机选择一个初始状态 s\n", |
| 84 | + " - 3.2 若未到达目标状态,则执行以下几步\n", |
| 85 | + " - (1)在当前状态 s 的所有可能行动中选取一个行为 a\n", |
| 86 | + " - (2)利用选定的行为 a,得到下一个状态 $\\tilde{s}$\n", |
| 87 | + " - (3)按照前面的转移公式计算 Q(s, a)\n", |
| 88 | + " - (4)令 $s: = \\tilde{s}$" |
| 89 | + ] |
| 90 | + }, |
| 91 | + { |
| 92 | + "cell_type": "markdown", |
| 93 | + "metadata": {}, |
| 94 | + "source": [ |
| 95 | + "### 单步演示\n", |
| 96 | + "为了更好地理解 q-learning,我们可以示例其中一步。\n", |
| 97 | + "\n", |
| 98 | + "首先选择 $\\gamma = 0.8,ドル初始状态为 1,Q 初始化为零矩阵\n", |
| 99 | + "\n", |
| 100 | + "\n" |
| 101 | + ] |
| 102 | + }, |
| 103 | + { |
| 104 | + "cell_type": "markdown", |
| 105 | + "metadata": {}, |
| 106 | + "source": [ |
| 107 | + "\n", |
| 108 | + "\n", |
| 109 | + "因为是状态 1,所以我们观察 R 矩阵的第二行,负数表示非法行为,所以下一个状态只有两种可能,走到状态 3 或者走到状态 5,随机地,我们可以选择走到状态 5。\n", |
| 110 | + "\n", |
| 111 | + "当我们走到状态 5 之后,会发生什么事情呢?观察 R 矩阵的第 6 行可以发现,其对应于三个可能采取的动作:转至状态 1,4 或者 5,根据上面的转移公式,我们有\n", |
| 112 | + "\n", |
| 113 | + "$$Q(1, 5) = R(1, 5) + 0.8 * max\\{Q(5, 1), Q(5, 4), Q(5, 5)\\} = 100 + 0.8 * max\\{0, 0, 0\\} = 100$$\n", |
| 114 | + "\n", |
| 115 | + "所以现在 Q 矩阵进行了更新,变为了\n", |
| 116 | + "\n", |
| 117 | + "\n", |
| 118 | + "\n", |
| 119 | + "现在我们的状态由 1 变成了 5,因为 5 是最终的目标状态,所以一次 episode 便完成了,进入下一个 episode。\n", |
| 120 | + "\n", |
| 121 | + "在下一个 episode 中又随机选择一个初始状态开始,不断更新 Q 矩阵,在经过了很多个 episode 之后,矩阵 Q 接近收敛,那么我们的智能体就学会了从任意状态转移到目标状态的最优路径。" |
| 122 | + ] |
| 123 | + }, |
| 124 | + { |
| 125 | + "cell_type": "markdown", |
| 126 | + "metadata": {}, |
| 127 | + "source": [ |
| 128 | + "从上面的原理,我们知道了 q-learning 最重要的状态转移公式,这个公式也叫做 Bellman Equation,通过这个公式我们能够不断地进行更新 Q 矩阵,最后得到一个收敛的 Q 矩阵。\n", |
| 129 | + "\n", |
| 130 | + "下面我们通过代码来实现这个过程\n", |
| 131 | + "\n", |
| 132 | + "我们定义一个简单的走迷宫过程,也就是\n", |
| 133 | + "\n", |
| 134 | + "" |
| 135 | + ] |
| 136 | + }, |
| 137 | + { |
| 138 | + "cell_type": "markdown", |
| 139 | + "metadata": {}, |
| 140 | + "source": [ |
| 141 | + "初始位置随机在 state 0, state 1 和 state 2 上,然后希望智能体能够走到 state 3 获得宝藏,上面可行的行动路线已经用箭头标注了" |
| 142 | + ] |
| 143 | + }, |
| 144 | + { |
| 145 | + "cell_type": "code", |
| 146 | + "execution_count": 1, |
| 147 | + "metadata": { |
| 148 | + "collapsed": true |
| 149 | + }, |
| 150 | + "outputs": [], |
| 151 | + "source": [ |
| 152 | + "import numpy as np\n", |
| 153 | + "import random" |
| 154 | + ] |
| 155 | + }, |
| 156 | + { |
| 157 | + "cell_type": "markdown", |
| 158 | + "metadata": {}, |
| 159 | + "source": [ |
| 160 | + "下面定义奖励矩阵,一共是 4 行,5 列,每一行分别表示 state 0 到 state 3 这四个状态,每一列分别表示上下左右和静止 5 种状态,奖励矩阵中的 0 表示不可行的路线,比如第一个行,上走和左走都是不可行的路线,都用 0 表示,向下走会走到陷阱,所以使用 -10 表示奖励,向右走和静止都给与 -1 的奖励,因为既没有触发陷阱,也没有到达宝藏,但是过程中浪费了时间。" |
| 161 | + ] |
| 162 | + }, |
| 163 | + { |
| 164 | + "cell_type": "code", |
| 165 | + "execution_count": 2, |
| 166 | + "metadata": { |
| 167 | + "collapsed": true |
| 168 | + }, |
| 169 | + "outputs": [], |
| 170 | + "source": [ |
| 171 | + "reward = np.array([[0, -10, 0, -1, -1],\n", |
| 172 | + " [0, 10, -1, 0, -1],\n", |
| 173 | + " [-1, 0, 0, 10, -10],\n", |
| 174 | + " [-1, 0, -10, 0, 10]])" |
| 175 | + ] |
| 176 | + }, |
| 177 | + { |
| 178 | + "cell_type": "markdown", |
| 179 | + "metadata": {}, |
| 180 | + "source": [ |
| 181 | + "接下来定义一个初始化为 0 的 q 矩阵" |
| 182 | + ] |
| 183 | + }, |
| 184 | + { |
| 185 | + "cell_type": "code", |
| 186 | + "execution_count": 3, |
| 187 | + "metadata": { |
| 188 | + "collapsed": true |
| 189 | + }, |
| 190 | + "outputs": [], |
| 191 | + "source": [ |
| 192 | + "q_matrix = np.zeros((4, 5))" |
| 193 | + ] |
| 194 | + }, |
| 195 | + { |
| 196 | + "cell_type": "markdown", |
| 197 | + "metadata": {}, |
| 198 | + "source": [ |
| 199 | + "然后定义一个转移矩阵,也就是从一个状态,采取一个可行的动作之后到达的状态,因为这里的状态和动作都是有限的,所以我们可以将他们存下来,比如第一行表示 state 0,向上和向左都是不可行的路线,所以给 -1 的值表示,向下走到达了 state 2,所以第二个值为 2,向右走到达了 state 1,所以第四个值是 1,保持不同还是在 state 0,所以最后一个标注为 0,另外几行类似。" |
| 200 | + ] |
| 201 | + }, |
| 202 | + { |
| 203 | + "cell_type": "code", |
| 204 | + "execution_count": 7, |
| 205 | + "metadata": { |
| 206 | + "collapsed": true |
| 207 | + }, |
| 208 | + "outputs": [], |
| 209 | + "source": [ |
| 210 | + "transition_matrix = np.array([[-1, 2, -1, 1, 0],\n", |
| 211 | + " [-1, 3, 0, -1, 1],\n", |
| 212 | + " [0, -1, -1, 3, 2],\n", |
| 213 | + " [1, -1, 2, -1, 3]])" |
| 214 | + ] |
| 215 | + }, |
| 216 | + { |
| 217 | + "cell_type": "markdown", |
| 218 | + "metadata": {}, |
| 219 | + "source": [ |
| 220 | + "最后定义每个状态的有效行动,比如 state 0 的有效行动就是下、右和静止,对应于 1,3 和 4" |
| 221 | + ] |
| 222 | + }, |
| 223 | + { |
| 224 | + "cell_type": "code", |
| 225 | + "execution_count": 8, |
| 226 | + "metadata": { |
| 227 | + "collapsed": true |
| 228 | + }, |
| 229 | + "outputs": [], |
| 230 | + "source": [ |
| 231 | + "valid_actions = np.array([[1, 3, 4],\n", |
| 232 | + " [1, 2, 4],\n", |
| 233 | + " [0, 3, 4],\n", |
| 234 | + " [0, 2, 4]])" |
| 235 | + ] |
| 236 | + }, |
| 237 | + { |
| 238 | + "cell_type": "code", |
| 239 | + "execution_count": 9, |
| 240 | + "metadata": { |
| 241 | + "collapsed": true |
| 242 | + }, |
| 243 | + "outputs": [], |
| 244 | + "source": [ |
| 245 | + "# 定义 bellman equation 中的 gamma\n", |
| 246 | + "gamma = 0.8" |
| 247 | + ] |
| 248 | + }, |
| 249 | + { |
| 250 | + "cell_type": "markdown", |
| 251 | + "metadata": {}, |
| 252 | + "source": [ |
| 253 | + "最后开始让智能体与环境交互,不断地使用 bellman 方程来更新 q 矩阵,我们跑 10 个 episode" |
| 254 | + ] |
| 255 | + }, |
| 256 | + { |
| 257 | + "cell_type": "code", |
| 258 | + "execution_count": 10, |
| 259 | + "metadata": {}, |
| 260 | + "outputs": [ |
| 261 | + { |
| 262 | + "name": "stdout", |
| 263 | + "output_type": "stream", |
| 264 | + "text": [ |
| 265 | + "episode: 0, q matrix: \n", |
| 266 | + "[[ 0. 0. 0. -1. -1.]\n", |
| 267 | + " [ 0. 10. -1. 0. -1.]\n", |
| 268 | + " [ 0. 0. 0. 0. 0.]\n", |
| 269 | + " [ 0. 0. 0. 0. 0.]]\n", |
| 270 | + "\n", |
| 271 | + "episode: 1, q matrix: \n", |
| 272 | + "[[ 0. 0. 0. -1. -1.]\n", |
| 273 | + " [ 0. 10. -1. 0. -1.]\n", |
| 274 | + " [ 0. 0. 0. 10. 0.]\n", |
| 275 | + " [ 0. 0. 0. 0. 0.]]\n", |
| 276 | + "\n", |
| 277 | + "episode: 2, q matrix: \n", |
| 278 | + "[[ 0. -2. 0. 7. 4.6]\n", |
| 279 | + " [ 0. 10. 4.6 0. 7. ]\n", |
| 280 | + " [ -1.8 0. 0. 10. -2. ]\n", |
| 281 | + " [ 0. 0. 0. 0. 0. ]]\n", |
| 282 | + "\n", |
| 283 | + "episode: 3, q matrix: \n", |
| 284 | + "[[ 0. -2. 0. 7. 4.6]\n", |
| 285 | + " [ 0. 10. 4.6 0. 7. ]\n", |
| 286 | + " [ 4.6 0. 0. 10. -2. ]\n", |
| 287 | + " [ 0. 0. 0. 0. 0. ]]\n", |
| 288 | + "\n", |
| 289 | + "episode: 4, q matrix: \n", |
| 290 | + "[[ 0. -2. 0. 7. 4.6]\n", |
| 291 | + " [ 0. 10. 4.6 0. 7. ]\n", |
| 292 | + " [ 4.6 0. 0. 10. -2. ]\n", |
| 293 | + " [ 0. 0. 0. 0. 0. ]]\n", |
| 294 | + "\n", |
| 295 | + "episode: 5, q matrix: \n", |
| 296 | + "[[ 0. -2. 0. 7. 4.6]\n", |
| 297 | + " [ 0. 10. 4.6 0. 7. ]\n", |
| 298 | + " [ 4.6 0. 0. 10. -2. ]\n", |
| 299 | + " [ 0. 0. 0. 0. 0. ]]\n", |
| 300 | + "\n", |
| 301 | + "episode: 6, q matrix: \n", |
| 302 | + "[[ 0. -2. 0. 7. 4.6]\n", |
| 303 | + " [ 0. 10. 4.6 0. 7. ]\n", |
| 304 | + " [ 4.6 0. 0. 10. -2. ]\n", |
| 305 | + " [ 0. 0. 0. 0. 0. ]]\n", |
| 306 | + "\n", |
| 307 | + "episode: 7, q matrix: \n", |
| 308 | + "[[ 0. -2. 0. 7. 4.6]\n", |
| 309 | + " [ 0. 10. 4.6 0. 7. ]\n", |
| 310 | + " [ 4.6 0. 0. 10. -2. ]\n", |
| 311 | + " [ 0. 0. 0. 0. 0. ]]\n", |
| 312 | + "\n", |
| 313 | + "episode: 8, q matrix: \n", |
| 314 | + "[[ 0. -2. 0. 7. 4.6]\n", |
| 315 | + " [ 0. 10. 4.6 0. 7. ]\n", |
| 316 | + " [ 4.6 0. 0. 10. -2. ]\n", |
| 317 | + " [ 0. 0. 0. 0. 0. ]]\n", |
| 318 | + "\n", |
| 319 | + "episode: 9, q matrix: \n", |
| 320 | + "[[ 0. -2. 0. 7. 4.6]\n", |
| 321 | + " [ 0. 10. 4.6 0. 7. ]\n", |
| 322 | + " [ 4.6 0. 0. 10. -2. ]\n", |
| 323 | + " [ 0. 0. 0. 0. 0. ]]\n", |
| 324 | + "\n" |
| 325 | + ] |
| 326 | + } |
| 327 | + ], |
| 328 | + "source": [ |
| 329 | + "for i in range(10):\n", |
| 330 | + " start_state = np.random.choice([0, 1, 2], size=1)[0] # 随机初始起点\n", |
| 331 | + " current_state = start_state\n", |
| 332 | + " while current_state != 3: # 判断是否到达终点\n", |
| 333 | + " action = random.choice(valid_actions[current_state]) # greedy 随机选择当前状态下的有效动作\n", |
| 334 | + " next_state = transition_matrix[current_state][action] # 通过选择的动作得到下一个状态\n", |
| 335 | + " future_rewards = []\n", |
| 336 | + " for action_nxt in valid_actions[next_state]:\n", |
| 337 | + " future_rewards.append(q_matrix[next_state][action_nxt]) # 得到下一个状态所有可能动作的奖励\n", |
| 338 | + " q_state = reward[current_state][action] + gamma * max(future_rewards) # bellman equation\n", |
| 339 | + " q_matrix[current_state][action] = q_state # 更新 q 矩阵\n", |
| 340 | + " current_state = next_state # 将下一个状态变成当前状态\n", |
| 341 | + " \n", |
| 342 | + " print('episode: {}, q matrix: \\n{}'.format(i, q_matrix))\n", |
| 343 | + " print()" |
| 344 | + ] |
| 345 | + }, |
| 346 | + { |
| 347 | + "cell_type": "markdown", |
| 348 | + "metadata": { |
| 349 | + "collapsed": true |
| 350 | + }, |
| 351 | + "source": [ |
| 352 | + "可以看到在第一次 episode 之后,智能体就学会了在 state 2 的时候向下走能够得到奖励,通过不断地学习,在 10 个 episode 之后,智能体知道,在 state 0,向右走能得到奖励,在 state 1 向下走能够得到奖励,在 state 3 向右 走能得到奖励,这样在这个环境中任何一个状态智能体都能够知道如何才能够最快地到达宝藏的位置\n", |
| 353 | + "\n", |
| 354 | + "从上面的例子我们简单的演示了 q-learning,可以看出自己来构建整个环境是非常麻烦的,所以我们可以通过一些第三方库来帮我们搭建强化学习的环境,其中最有名的就是 open-ai 的 gym 模块,下一章我们将介绍一下 gym。" |
| 355 | + ] |
| 356 | + } |
| 357 | + ], |
| 358 | + "metadata": { |
| 359 | + "kernelspec": { |
| 360 | + "display_name": "Python 3", |
| 361 | + "language": "python", |
| 362 | + "name": "python3" |
| 363 | + }, |
| 364 | + "language_info": { |
| 365 | + "codemirror_mode": { |
| 366 | + "name": "ipython", |
| 367 | + "version": 3 |
| 368 | + }, |
| 369 | + "file_extension": ".py", |
| 370 | + "mimetype": "text/x-python", |
| 371 | + "name": "python", |
| 372 | + "nbconvert_exporter": "python", |
| 373 | + "pygments_lexer": "ipython3", |
| 374 | + "version": "3.6.3" |
| 375 | + } |
| 376 | + }, |
| 377 | + "nbformat": 4, |
| 378 | + "nbformat_minor": 2 |
| 379 | +} |
0 commit comments