Page 87 - Read Online
P. 87
Page 280 Zhang et al. Intell Robot 2022;2(3):27597 I http://dx.doi.org/10.20517/ir.2022.20
Figure 4. The key components of the DRL-based controller design from the classification result of the most relevant publications. Tables 1
and 2 in the Appendix provide a completed summary.
conferences in the machine learning field.
3.5. State, action, reward, and others
State, action, and reward are integral and important components for training controllers. The design of these
components will directly affect the performance of the controller. However, there is no fully unified standard
and method for the specific design.
For the design of state space, on the one hand, considering too few observations can lead to a partially ob-
servable controller. On the other hand, providing all available readings results in a brittle controller that is
overfitted to the simulation environment. Both affect the performance of the controller in the real machine,
so researchers can only make trade-offs based on practical problems. In current research works, for simple
tasks (walking, turning on flat ground, etc.), proprioception alone (base orientation, angular velocity, joint po-
sition and velocity, etc.) is sufficient to solve the problem [10,39,41] . For more complex tasks (walking on uneven
ground, climbing stairs or hillsides, avoiding obstacles, etc.), exteroception, such as visual information, needs
to be introduced [8,13,42] . Adding additional sensors alleviates the partial observation issues to some extent.
Most researchers use the desired joint positions (residuals) as the action space and then calculate the torque
through a PD controller to control the robot locomotion. Early studies [43] experimentally demonstrated that
controllers with such action space can achieve better performance. However, recent studies also attempt to use
lower-level control commands to obtain highly dynamic motion behavior to avoid the use of PD controllers
and control torque directly [44] . Although the current DRL-based controllers have achieved outstanding per-
formance [6–8] , their stability is still not as good as the common control methods, such as MPC controllers [45] .
The force–position hybrid control method adopted by MPC is worthy of reference and further research. Fur-
thermore, in some studies based on hierarchical DRL, the latent commands serve as the action space of the
high-level policy to guide the behavior of low-level policies [46,47] .
In general, the design of the reward function is fairly laborious, especially for complex systems such as robots.
Small changes in the reward function hyperparameters have the potential to have a large impact on the final
performanceofthecontroller. Inorderfortherobottocompletemorecomplextasks,therewardfunctionmust
be designed with sufficient detail [6–8,48] . Some specific factors include the desired direction, base orientation,
angular velocity, base linear velocity, joint position and velocity, foot contact states, policy output, and motor