-
Notifications
You must be signed in to change notification settings - Fork 44
Chainer RL Agents
This repository contains a few RL agents as samples, in order to give you hints about how to implement a more involved MARLO agent (note that it's not mandatory to use ChainerRL to participate in the context, though).
Chainer "is a Python-based deep learning framework aiming at flexibility". It has a very powerful high-level API aimed at training deep learning networks and as such is very useful in a RL context. ChainerRL is a deep reinforcement learning library that implements various state-of-the-art deep reinforcement algorithms in Python using Chainer.
The framework presents a wide range of algorithms and deep learning tools which facilitate a quick start-up and as such is ideal for drafts. ChainerRL communicates seamlessly with OpenAI's Gym framework, thus relieving a lot of structural stress off of you - the competitor - and allowing you to focus strictly on your agent's behaviour.
Please refer to ChainerRL's official GitHub for installation instructions and further documentation. Alternatively, you can simply use PyPi to download and install ChainerRL as a package via the following command: pip install chainerrl. Following this, you can simply proceed to testing it out via following the steps laid out in ChainerRL's official getting started guide.
As described previously, ChainerRL interacts with Marlo seamlessly. Let us take the implementation of a PPO agent in Minecraft using ChainerRL and Marlo and break it down into simpler steps.
First it is necessary to start up a Minecraft client on port 10000 such that Marlo can use it. This is very easy to do if you have followed the previous steps and have therefore installed a repacked version of Malmo, since it comes already packed with a Minecraft launcher. Simply navigate to your Malmo folder with your favorite CLI and then:
cd Minecraft
.\launchClient.bator
cd Minecraft .\launchClient.sh
If this is your first time running Malmo, please be patient as building the Gradle client takes a while. Don't worry if the console shows 95% completion as long as the game runs: the server is running and agents will work.
In case you've installed Malmo as a PyPi wheel, you can use the following command:
python3 -c 'import malmo.minecraftbootstrap; malmo.minecraftbootstrap.launch_minecraft()'
You are now ready to start putting together your ChainerRL PPO!
Let us first import PPO, its subsidiary A3C, as well as other chainer and chainerRL-related classes. We will also import numpy and marlo as these will be used later.
from chainerrl.agents import a3c from chainerrl.agents import PPO from chainerrl import experiments from chainerrl import links from chainerrl import misc from chainerrl.optimizers.nonbias_weight_decay import NonbiasWeightDecay from chainerrl import policies import chainer import logging import sys import gym import numpy as np import marlo import time # Tweakable parameters, can be turned into args if needed gpu = 1 steps = 10 ** 6 eval_n_runs = 10 eval_interval = 10000 update_interval = 2048 outdir = 'results' lr = 3e-4 bound_mean = False normalize_obs = False
We will use the A3C feedforward softmax policy, and this will be implemented in a standard fashion as below:
class A3CFFSoftmax(chainer.ChainList, a3c.A3CModel): """An example of A3C feedforward softmax policy.""" def __init__(self, ndim_obs, n_actions, hidden_sizes=(200, 200)): self.pi = policies.SoftmaxPolicy( model=links.MLP(ndim_obs, n_actions, hidden_sizes)) self.v = links.MLP(ndim_obs, 1, hidden_sizes=hidden_sizes) super().__init__(self.pi, self.v) def pi_and_v(self, state): return self.pi(state), self.v(state)
First, let us create a phi function that transforms items to float32 (since ChaineRL uses float32, but Gym uses float64!)
def phi(obs): return obs.astype(np.float32)
With that out of the way, let us create the environment in a typical Gym fashion.
# Ensure that you have a minecraft-client running with : marlo-server --port 10000 env = gym.make('MinecraftCliffWalking1-v0') env.init( allowContinuousMovement=["move", "turn"], videoResolution=[800, 600] )
Marlo environments support a wide range of initialization parameters, as seen here. You can use any of these in the env_init() function.
Currently, the number of available environments is limited and their string titles can all be found here. Feel free to swap any of these in the gym.make("") call at the beginning of the file in order to select a different mission to train on.
Finally, let us render the environment and print out some helpful statistics.
obs = env.reset() env.render() print('initial observation:', obs) action = env.action_space.sample() obs, r, done, info = env.step(action) print('next observation:', obs) print('reward:', r) print('done:', done) print('info:', info) print('actions:', str(env.action_space))
The print comments are there solely for debugging reasons, they tend to be rather helpful when something goes wrong whilst trying to kick an environment off.
In order to create a PPO agent, we must initialize it. ChainerRL's PPO agent class requires a model parameter, which is represented here by our chosen softmax policy. Therefore, we need to instantiate our policy for use in the agent:
timestep_limit = env.spec.tags.get( 'wrapper_config.TimeLimit.max_episode_steps' ) obs_space = env.observation_space action_space = env.action_space model = A3CFFSoftmax(obs_space.low.size, action_space.n)
We should also use an optimizer for the policy. In this case we're using the Adam algorithm:
opt = chainer.optimizers.Adam(alpha=lr, eps=1e-5) opt.setup(model)
Finally, we initialize PPO with the policy, optimizer and pre-set variables as declared at the top of the file.
# Initialize the agent agent = PPO( model, opt, gpu=gpu, phi=phi, update_interval=update_interval, minibatch_size=64, epochs=10, clip_eps_vf=None, entropy_coef=0.0, )
This step is simply used as part of the implementation of PPO, which supposes a linear decay for the learning rate towards zero:
# Linearly decay the learning rate to zero def lr_setter(env, agent, value): agent.optimizer.alpha = value lr_decay_hook = experiments.LinearInterpolationHook( steps, 3e-4, 0, lr_setter)
and a linear decay of the clipping rate towards zero:
# Linearly decay the clipping parameter to zero def clip_eps_setter(env, agent, value): agent.clip_eps = value clip_eps_decay_hook = experiments.LinearInterpolationHook( steps, 0.2, 0, clip_eps_setter)
We should loop over the number of episodes and timesteps as initialized at the beginning of this file whilst calling the act() method of the PPO as we go, which can be rather cumbersome. Fortunately, ChainerRL provides an easy way to do this via its experiments pack. Let us call the train_agent_with_evaluation() function on our PPO:
# Start training/evaluation experiments.train_agent_with_evaluation( agent=agent, env=env, eval_env=env, outdir=outdir, steps=steps, eval_n_runs=eval_n_runs, eval_interval=eval_interval, max_episode_len=timestep_limit, step_hooks=[ lr_decay_hook, clip_eps_decay_hook, ], )
Et voila! Your agent is now ready to start aggressively walking towards walls for weeks on end as it finds its way through the complex jungle that Minecraft gameplay is!