Page 89 - Read Online

P. 89

Page 282 Zhang et al. Intell Robot 2022;2(3):27597 I http://dx.doi.org/10.20517/ir.2022.20

rithm can deal with sparse-reward tasks, which greatly reduces the difficulty of designing reward functions. It
also alleviates the serious time burden for the researchers to tune the parameters of reward function.

4.1.2. Generalization and adaptation
Generalization is another fundamental problem of the DRL algorithm. Current algorithms perform well in
single-task and static environments, but they struggle with multi-task and dynamically unstructured envi-
ronments. That is, it is difficult for robots to acquire novel skills and quickly adapt to unseen environments
or tasks. Generalization or adaptation to new scenarios remains a long-standing unsolved problem in the
DRL community. In general, there are two broad categories of problems in robotics tasks: the observational
generalization (adaptation) problem and the dynamic generalization (adaptation) problem. The former is a
learning problem for robots considering high-dimensional state spaces, such as raw visual sensor observations.
High-dimensional observations may incorporate redundant, task-irrelevant information that may impair the
generalization ability of robot learning. Currently, there are many related studies published on physical manip-
ulation [67–71] but only a few cutting-edge works on quadrupedal locomotion tasks [8,11,13] . The latter mainly
takes into account the dynamic changes of the environment (e.g., robot body mass and ground friction coeffi-
cient) [72–74] . This causes the transition probability of the environment to change, i.e., the robot takes the same
action in the same state, but it transitions to a different next state.

4.1.3. Partial observation
Simulators can significantly reduce the training difficulty of the DRL algorithms because we have access to the
ground-truth state of the robots. However, due to the limitations of the onboard sensors of real robots, the
policies are limited to partial observations that are often noisy and delayed. For example, it is difficult to accu-
rately measure the root translation and body height of a legged robot. This problem is more pronounced when
faced with locomotion or navigation tasks in complex and unstructured environments. Several approaches
have been proposed to alleviate this problem, such as applying system identification [75] , removing inaccessi-
ble states during training [39] , adding more sensors [8,11,13] , and learning to infer privileged information [7,76] .

4.1.4. Reality gap
This problem is caused by differences between the simulation and real-world physics [16] . There are many
sources of this discrepancy, including incorrect physical parameters, unmodeled dynamics, and random real-
world environments. Furthermore, there is no general consensus on which of these sources plays the most
importantrole. Astraightforwardapproachisdomainrandomization,aclassofmethodsthatusesawiderange
of environmental parameters and sensor noises to learn robust robot behaviors [39,77,78] . Since this method is
simple and effective, most studies on quadrupedal locomotion have used it to alleviate the reality gap problem.

4.2 Future prospects
4.2.1. Accelerate learning via model-based planning
For sequential decision making problems, model-based planning is a powerful approach to improve sample
efficiency and has achieved great success in applied domains such as game playing [79–81] and continuous con-
trol [82,83] . These methods, however, are both costly to plan over long horizons and struggle to obtain accurate
world models. More recently, the strengths of model-free and model-based methods are combined to achieve
superior sample efficiency and asymptotic performance on continuous control tasks [84] , especially on fairly
challenging, high-dimensional humanoid and dog tasks [85] . How to use model-based planning in DRL-based
quadrupedal locomotion research is an issue worthy of further exploration.

4.2.2. Reuse of motion priors data
Current vanilla DRL algorithms have difficulty producing life-like natural behaviors for legged robots. Further-
more, reward functions capable of accomplishing complex tasks often require a tedious and labor-intensive
tuning process. Robots also struggle to generalize or adapt to other environments or tasks. To alleviate this

84 85 86 87 88 89 90 91 92 93 94