wordpress 时间标题展示,南宁seo优化,网站建设需用要什么,银川网站设计建设一、环境适当调整
数据收集#xff1a;RecordEpisodeStatistics进行起始跳过n帧#xff1a;baseSkipFrame一条生命结束记录为done:EpisodicLifeEnv得分处理成0或1:ClipRewardEnv叠帧: FrameStack 图像环境的基本操作#xff0c;方便CNN捕捉智能体的行动 向量空间reset处理修…一、环境适当调整
数据收集RecordEpisodeStatistics进行起始跳过n帧baseSkipFrame一条生命结束记录为done:EpisodicLifeEnv得分处理成0或1:ClipRewardEnv叠帧: FrameStack 图像环境的基本操作方便CNN捕捉智能体的行动 向量空间reset处理修复 gym.vector.SyncVectorEnv: 原始代码中的reset是随机的继承重写的spSyncVectorEnv方法支持每个向量的环境的seed一致利于同一seed下环境的训练 class spSyncVectorEnv(gym.vector.SyncVectorEnv):step_await _terminateds resetdef __init__(self,env_fns: Iterable[Callable[[], Env]],observation_space: Space None,action_space: Space None,copy: bool True,random_reset: bool False,seed: int None):super().__init__(env_fns, observation_space, action_space, copy)self.random_reset random_resetself.seed seeddef step_wait(self) - Tuple[Any, NDArray[Any], NDArray[Any], NDArray[Any], dict]:Steps through each of the environments returning the batched results.Returns:The batched environment step resultsobservations, infos [], {}for i, (env, action) in enumerate(zip(self.envs, self._actions)):(observation,self._rewards[i],self._terminateds[i],self._truncateds[i],info,) env.step(action)if self._terminateds[i]:old_observation, old_info observation, infoif self.random_reset:observation, info env.reset(seednp.random.randint(0, 999999))else:observation, info env.reset() if self.seed is None else env.reset(seedself.seed) info[final_observation] old_observationinfo[final_info] old_infoobservations.append(observation)infos self._add_info(infos, info, i)self.observations concatenate(self.single_observation_space, observations, self.observations)return (deepcopy(self.observations) if self.copy else self.observations,np.copy(self._rewards),np.copy(self._terminateds),np.copy(self._truncateds),infos,)二、pytorch实践
2.1 智能体构建与训练 详细可见 Github: test_ppo_atari.Breakout_v5_ppo2_test 调整向量环境的reset 之后
支持actor, criticor用同一个cnn层提取特征(PPOSharedCNN)对eps进行了调小-eps0.165希望更新的策略范围更小一些关闭学习率衰减进行不同ent_coef的尝试: 稍微大一点增加agent的探索 ent_coef0.015 batch_size256128batch 陡降-回升慢ent_coef0.025 batch_size256 陡降回升-最终reward311√ ent_coef0.05 batch_size256 -最终PPO2__AtariEnv instance__20241029__2217 reward416ent_coef0.05 batch_size256128ent_coef0.1 batch_size256 提升过于平缓
env_name ALE/Breakout-v5
env_name_str env_name.replace(/, -)
gym_env_desc(env_name)
print(gym.__version__ , gym.__version__ )
path_ os.path.dirname(__file__)
num_envs 12
episod_life True
clip_reward True
resize_inner_area True # True
env_pool_flag False # True
seed 202404
envs spSyncVectorEnv([make_atari_env(env_name, skip4, episod_lifeepisod_life, clip_rewardclip_reward, ppo_trainTrue, max_no_reward_count120, resize_inner_arearesize_inner_area) for _ in range(num_envs)],random_resetFalse,seed202404
)
dist_type norm
cfg Config(envs, save_pathos.path.join(path_, test_models ,fPPO2_{env_name_str}-2), seed202404,num_envsnum_envs,episod_lifeepisod_life,clip_rewardclip_reward,resize_inner_arearesize_inner_area,env_pool_flagenv_pool_flag,# 网络参数 Atria-CNN MLPactor_hidden_layers_dim[512, 256], critic_hidden_layers_dim[512, 128], # agent参数actor_lr4.5e-4, gamma0.99,# 训练参数num_episode3600, off_buffer_size128, max_episode_steps128, PPO_kwargs{cnn_flag: True,clean_rl_cnn: True,share_cnn_flag: True,continue_action_flag: False,lmbda: 0.95,eps: 0.165, # 0.165k_epochs: 4, # update_epochssgd_batch_size: 512, minibatch_size: 256, act_type: relu,dist_type: dist_type,critic_coef: 1.0, # 1.0ent_coef: 0.05, max_grad_norm: 0.5, clip_vloss: True,mini_adv_norm: True,anneal_lr: False,num_episode: 3600,}
)
minibatch_size cfg.PPO_kwargs[minibatch_size]
max_grad_norm cfg.PPO_kwargs[max_grad_norm]
cfg.trail_desc factor_lr{cfg.actor_lr},minibatch_size{minibatch_size},max_grad_norm{max_grad_norm},hidden_layers{cfg.actor_hidden_layers_dim},
agent PPO2(state_dimcfg.state_dim,actor_hidden_layers_dimcfg.actor_hidden_layers_dim,critic_hidden_layers_dimcfg.critic_hidden_layers_dim,action_dimcfg.action_dim,actor_lrcfg.actor_lr,critic_lrcfg.critic_lr,gammacfg.gamma,PPO_kwargscfg.PPO_kwargs,devicecfg.device,reward_funcNone
)
agent.train()
ppo2_train(envs, agent, cfg, wandb_flagTrue, wandb_project_namefPPO2-{env_name_str}-NEW,train_without_seedFalse, test_ep_freqcfg.off_buffer_size * 10, online_collect_numscfg.off_buffer_size,test_episode_count10, add_max_step_reward_flagFalse,play_funcppo2_play,ply_envply_env
)2.2 训练出的智能体观测
最后将训练的最好的网络拿出来进行观察 env make_atari_env(env_name, skip4, episod_lifeepisod_life, clip_rewardclip_reward, ppo_trainTrue, max_no_reward_count120, resize_inner_arearesize_inner_area, render_modehuman)()
ppo2_play(env, agent, cfg, episode_count2, play_without_seedFalse, renderTrue, ppo_trainTrue)