This project is the algorithm Soft Actor-Critic with a series of advanced features implemented by PyTorch. It can be used to train Gym, PyBullet and Unity environments with ML-Agents.
- N-step
- V-trace (IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures)
- Prioritized Experience Replay (100% Numpy sumtree)
- *Episode Experience Replay
- R2D2 (Recurrent Experience Replay In Distributed Reinforcement Learning)
- *Representation model, Q function and policy strucutres
- *Recurrent Prediction Model
- Noisy Networks for Exploration (Noisy Networks for Exploration)
- Distributed training (Distributed Prioritized Experience Replay)
- Discrete action (Soft Actor-Critic for Discrete Action Settings)
- Curiosity mechanism (Curiosity-driven Exploration by Self-supervised Prediction)
- Large-scale Distributed Evolutionary Reinforcement Learning (The distributed training module is suspended for maintenance)
- ATC, BYOL
* denotes the features that we implemented.
Gym, PyBullet and Unity environments with ML-Agents.
Observation can be any combination of vectors and images, which means an agent can have multiple sensors and the resolution of each image can be different.
Action space can be both continuous and discrete.
Not supporting multi-agent environments.
All neural network models should be in a .py file (default nn.py). All training configurations should be specified in config.yaml.
Both neural network models and training configurations should be placed in the same folder under envs.
All default training configurations are listed below. It can also be found in algorithm/default_config.yaml
base_config: env_type: UNITY # UNITY | GYM | DM_CONTROL env_name: env_name # The environment name. env_args: null unity_args: # Only for Unity Environments no_graphics: true # If an env does not need pixel input, set true build_path: # Unity executable path win32: path_win32 linux: path_linux port: 5005 name: "{time}" # Training name. Placeholder "{time}" will be replaced to the time that trianing begins n_envs: 1 # N environments running in parallel max_iter: -1 # Max iteration max_step: -1 # Max step. Training will be terminated if max_iter or max_step encounters max_step_each_iter: -1 # Max step in each iteration reset_on_iteration: true # Whether forcing reset agent if an episode terminated reset_config: null # Reset parameters sent to Unity nn_config: rep: null policy: null replay_config: capacity: 524288 alpha: 0.9 # [0~1] convert the importance of TD error to priority. If 0, PER will reduce to vanilla replay buffer beta: 0.4 # Importance-sampling, from initial value increasing to 1 beta_increment_per_sampling: 0.001 # Increment step td_error_min: 0.01 # Small amount to avoid zero priority td_error_max: 1. # Clipped abs error sac_config: nn: nn # Neural network models file seed: null # Random seed write_summary_per_step: 1000 # Write summaries in TensorBoard every N steps save_model_per_step: 5000 # Save model every N steps use_replay_buffer: true # Whether using prioritized replay buffer use_priority: true # Whether using PER importance ratio ensemble_q_num: 2 # Number of Qs ensemble_q_sample: 2 # Number of min Qs burn_in_step: 0 # Burn-in steps in R2D2 n_step: 1 # Update Q function by N-steps seq_encoder: null # RNN | ATTN batch_size: 256 # Batch size for training tau: 0.005 # Coefficient of updating target network update_target_per_step: 1 # Update target network every N steps init_log_alpha: -2.3 # The initial log_alpha use_auto_alpha: true # Whether using automating entropy adjustment learning_rate: 0.0003 # Learning rate of all optimizers gamma: 0.99 # Discount factor v_lambda: 1.0 # Discount factor for V-trace v_rho: 1.0 # Rho for V-trace v_c: 1.0 # C for V-trace clip_epsilon: 0.2 # Epsilon for q clip discrete_dqn_like: false # Whether using policy or only Q network if discrete is in action spaces use_n_step_is: true # Whether using importance sampling siamese: null # ATC | BYOL siamese_use_q: false # Whether using contrastive q siamese_use_adaptive: false # Whether using adaptive weights use_prediction: false # Whether training a transition model transition_kl: 0.8 # The coefficient of KL of transition and standard normal use_extra_data: true # Whether using extra data to train prediction model curiosity: null # FORWARD | INVERSE curiosity_strength: 1 # Curiosity strength if using curiosity use_rnd: false # Whether using RND rnd_n_sample: 10 # RND sample times use_normalization: false # Whether using observation normalization action_noise: null # [noise_min, noise_max] ma_config: null
All default distributed training configurations are listed below. It can also be found in ds/default_config.yaml
base_config: env_type: UNITY # UNITY or GYM env_name: env_name # The environment name. env_args: null unity_args: # Only for Unity Environments no_graphics: true # If an env does not need pixel input, set true build_path: # Unity executable path win32: path_win32 linux: path_linux port: 5005 name: "{time}" # Training name. Placeholder "{time}" will be replaced to the time that trianing begins update_sac_bak_per_step: 200 # Every N step update sac_bak n_envs: 1 # N environments running in parallel max_step_each_iter: -1 # Max step in each iteration reset_on_iteration: true # Whether forcing reset agent if an episode terminated evolver_enabled: true evolver_cem_length: 50 # Start CEM if all learners have eavluated evolver_cem_length times evolver_cem_best: 0.3 # The ratio of best learners evolver_cem_min_length: 2 # Start CEM if all learners have eavluated `evolver_cem_min_length` times, # and it has been more than `evolver_cem_time` minutes since the last update evolver_cem_time: 3 evolver_remove_worst: 4 max_actors_each_learner: -1 # The max number of actors of each learner, -1 indicates no limit noise_increasing_rate: 0 # Noise = N * number of actors noise_max: 0.1 # Max noise for actors max_episode_length: 500 episode_queue_size: 5 episode_sender_process_num: 5 batch_queue_size: 5 batch_generator_process_num: 5 net_config: learner_host: null learner_port: 61001 reset_config: null # Reset parameters sent to Unity nn_config: rep: null policy: null sac_config: nn: nn # Neural network models file seed: null # Random seed write_summary_per_step: 1000 # Write summaries in TensorBoard every N steps save_model_per_step: 100000 # Save model every N steps ensemble_q_num: 2 # Number of Qs ensemble_q_sample: 2 # Number of min Qs burn_in_step: 0 # Burn-in steps in R2D2 n_step: 1 # Update Q function by N steps seq_encoder: null # RNN | ATTN batch_size: 256 tau: 0.005 # Coefficient of updating target network update_target_per_step: 1 # Update target network every N steps init_log_alpha: -2.3 # The initial log_alpha use_auto_alpha: true # If using automating entropy adjustment learning_rate: 0.0003 # Learning rate of all optimizers gamma: 0.99 # Discount factor v_lambda: 1.0 # Discount factor for V-trace v_rho: 1.0 # Rho for V-trace v_c: 1.0 # C for V-trace clip_epsilon: 0.2 # Epsilon for q clip discrete_dqn_like: false # If using policy or only Q network if discrete is in action spaces siamese: null # ATC | BYOL siamese_use_q: false # If using contrastive q siamese_use_adaptive: false # If using adaptive weights use_prediction: false # If train a transition model transition_kl: 0.8 # The coefficient of KL of transition and standard normal use_extra_data: true # If using extra data to train prediction model curiosity: null # FORWARD | INVERSE curiosity_strength: 1 # Curiosity strength if using curiosity use_rnd: false # If using RND rnd_n_sample: 10 # RND sample times use_normalization: false # If using observation normalization action_noise: null # [noise_min, noise_max] # random_params: # param_name: # in: [n1, n2, n3] # truncated: [n1 ,n2] # std: n ma_config: null
usage: main.py [-h] [--config CONFIG] [--run] [--logger_in_file] [--render] [--env_args ENV_ARGS] [--agents AGENTS]
[--max_iter MAX_ITER] [--port PORT] [--editor] [--name NAME] [--disable_sample] [--use_env_nn]
[--device DEVICE] [--ckpt CKPT] [--nn NN] [--repeat REPEAT]
env
positional arguments:
env
optional arguments:
-h, --help show this help message and exit
--config CONFIG, -c CONFIG
config file
--run inference mode
--logger_in_file logging into a file
--render render
--env_args ENV_ARGS additional args for environments
--agents AGENTS number of agents
--max_iter MAX_ITER max iteration
--port PORT, -p PORT UNITY: communication port
--editor UNITY: running in Unity Editor
--name NAME, -n NAME training name
--disable_sample disable sampling when choosing actions
--use_env_nn always use nn.py in env, or use saved nn_models.py if existed
--device DEVICE cpu or gpu
--ckpt CKPT ckeckpoint to restore
--nn NN neural network model
--repeat REPEAT number of repeated experiments
examples:
# Train gym environment mountain_car with name "test_{time}", 10 agents and repeating training two times
python main.py gym/mountain_car -n "test_{time}" --agents=10 --repeat=2
# Train unity environment roller with vanilla config and port 5006
python main.py roller -c vanilla -p 5006
# Inference unity environment roller with model "nowall_202003251644192jWy"
python main.py roller -c vanilla -n nowall_202003251644192jWy --run --agents=1
usage: main_ds.py [-h] [--config CONFIG] [--run] [--logger_in_file] [--learner_host LEARNER_HOST]
[--learner_port LEARNER_PORT] [--render] [--env_args ENV_ARGS] [--agents AGENTS]
[--unity_port UNITY_PORT] [--editor] [--name NAME] [--device DEVICE] [--ckpt CKPT] [--nn NN]
env {learner,l,actor,a}
positional arguments:
env
{learner,l,actor,a}
optional arguments:
-h, --help show this help message and exit
--config CONFIG, -c CONFIG
config file
--run inference mode
--logger_in_file logging into a file
--learner_host LEARNER_HOST
learner host
--learner_port LEARNER_PORT
learner port
--render render
--env_args ENV_ARGS additional args for environments
--agents AGENTS number of agents
--unity_port UNITY_PORT, -p UNITY_PORT
UNITY: communication port
--editor UNITY: running in Unity Editor
--name NAME, -n NAME training name
--device DEVICE cpu or gpu
--ckpt CKPT ckeckpoint to restore
--nn NN neural network model
examples:
python main_ds.py test learner --logger_in_file
python main_ds.py test actor --logger_in_file