Page 89 - Read Online
P. 89

Page 282                        Zhang et al. Intell Robot 2022;2(3):275­97  I http://dx.doi.org/10.20517/ir.2022.20

               rithm can deal with sparse-reward tasks, which greatly reduces the difficulty of designing reward functions. It
               also alleviates the serious time burden for the researchers to tune the parameters of reward function.

               4.1.2. Generalization and adaptation
               Generalization is another fundamental problem of the DRL algorithm. Current algorithms perform well in
               single-task and static environments, but they struggle with multi-task and dynamically unstructured envi-
               ronments. That is, it is difficult for robots to acquire novel skills and quickly adapt to unseen environments
               or tasks. Generalization or adaptation to new scenarios remains a long-standing unsolved problem in the
               DRL community. In general, there are two broad categories of problems in robotics tasks: the observational
               generalization (adaptation) problem and the dynamic generalization (adaptation) problem. The former is a
               learning problem for robots considering high-dimensional state spaces, such as raw visual sensor observations.
               High-dimensional observations may incorporate redundant, task-irrelevant information that may impair the
               generalization ability of robot learning. Currently, there are many related studies published on physical manip-
               ulation [67–71]  but only a few cutting-edge works on quadrupedal locomotion tasks [8,11,13] . The latter mainly
               takes into account the dynamic changes of the environment (e.g., robot body mass and ground friction coeffi-
               cient) [72–74] . This causes the transition probability of the environment to change, i.e., the robot takes the same
               action in the same state, but it transitions to a different next state.


               4.1.3. Partial observation
               Simulators can significantly reduce the training difficulty of the DRL algorithms because we have access to the
               ground-truth state of the robots. However, due to the limitations of the onboard sensors of real robots, the
               policies are limited to partial observations that are often noisy and delayed. For example, it is difficult to accu-
               rately measure the root translation and body height of a legged robot. This problem is more pronounced when
               faced with locomotion or navigation tasks in complex and unstructured environments. Several approaches
               have been proposed to alleviate this problem, such as applying system identification [75] , removing inaccessi-
               ble states during training [39] , adding more sensors [8,11,13] , and learning to infer privileged information [7,76] .

               4.1.4. Reality gap
               This problem is caused by differences between the simulation and real-world physics [16] . There are many
               sources of this discrepancy, including incorrect physical parameters, unmodeled dynamics, and random real-
               world environments. Furthermore, there is no general consensus on which of these sources plays the most
               importantrole. Astraightforwardapproachisdomainrandomization,aclassofmethodsthatusesawiderange
               of environmental parameters and sensor noises to learn robust robot behaviors [39,77,78] . Since this method is
               simple and effective, most studies on quadrupedal locomotion have used it to alleviate the reality gap problem.

               4.2 Future prospects
               4.2.1. Accelerate learning via model-based planning
               For sequential decision making problems, model-based planning is a powerful approach to improve sample
               efficiency and has achieved great success in applied domains such as game playing [79–81]  and continuous con-
               trol [82,83] . These methods, however, are both costly to plan over long horizons and struggle to obtain accurate
               world models. More recently, the strengths of model-free and model-based methods are combined to achieve
               superior sample efficiency and asymptotic performance on continuous control tasks [84] , especially on fairly
               challenging, high-dimensional humanoid and dog tasks [85] . How to use model-based planning in DRL-based
               quadrupedal locomotion research is an issue worthy of further exploration.


               4.2.2. Reuse of motion priors data
               Current vanilla DRL algorithms have difficulty producing life-like natural behaviors for legged robots. Further-
               more, reward functions capable of accomplishing complex tasks often require a tedious and labor-intensive
               tuning process. Robots also struggle to generalize or adapt to other environments or tasks. To alleviate this
   84   85   86   87   88   89   90   91   92   93   94