Page 88 - Read Online

P. 88

Zhang et al. Intell Robot 2022;2(3):27597 I http://dx.doi.org/10.20517/ir.2022.20 Page 281

Accelerate learning via
Model-based DRL
Open issues model-based planning
Generalization
Sample efficiency Future
and adaptation research Reuse of motion priors
directions data
Partial
observation Reality gap
Large-scale pretraining of
DRL models

Figure 5. In the DRL-based real-world quadrupedal locomotion field, open problems mainly include sample efficiency, generalization and
adaptation, partial observation, and reality gap. Future research directions are highlighted and pointed out around these open problems.
Based on the current research states of quadrupedal locomotion, we expound the future research prospects from multiple perspectives.
In particular, world models, skill data, and pre-trained models require significant attention, as these directions will play an integral role in
realizing legged robot intelligence.

torque.

Manystudieshavealso consideredadditional information, suchas trajectory generators [46,49–51] , controlmeth-
ods [52–54] , motion data [10,12,55,56] , etc. Trajectory generators and control methods mainly introduce prior
knowledge in the action space, narrowing the search range of DRL control policies, which greatly improves
the sample efficiency under a simple reward function. Motion data are often generated by other suboptimal
controllers or assessed via public datasets. Through imitation learning based on the motion data, the robot can
master behaviors and skills such as walking and turning. In both simulations and real-world deployment, the
robot eventually manages to generate natural and agile movement patterns and completes the assigned tasks
according to the external reward function.

3.6. Solution to reality gap
Under the current mainstream learning paradigm, the reality gap is an unavoidable problem that must be
addressed. The domain randomization method is used by most researchers due to its simplicity and effec-
tiveness. The difference between simulation and real environment is mainly reflected in physical parameters
and sensors. Therefore, researchers mainly randomize physical parameters (mass, inertia, motor strength, la-
tency, ground friction, etc.), add Gaussian noise to observations, and apply disturbing force, etc. [35,48,50,57,58] .
However, domain randomization methods trade optimality for robustness, which can lead to conservative
controllers [59] . Some studies have also used domain adaptation methods, that is, use real data to identify the
environment [60,61] or obtain accurate physical parameters [62] . Furthermore, these methods can improve the
generalization (adaptation) performance of robots in challenging environments. For more solutions to the
reality gap, please refer to the relevant review paper [63] .

4. OPEN PROBLEMS AND FUTURE PROSPECTS
In this section, we discuss the long-standing open questions and promising future research directions in the
DRL-based quadrupedal locomotion field around these issues, as shown in Figure 5. Solutions to these open
problems are described in Section 3.

4.1 Open problems
4.1.1. Sample efficiency
In many popular DRL algorithms, millions or billions of gradient descent steps are required to train policies
that can accomplish the assigned task [64–66] . For real robotics tasks, therefore, such a learning process requires
a significant number of interactions, which is infeasible in practical applications. In the face of increasingly
complex robotic tasks, without improvement in the sample efficiency of algorithms, the number of training
samples needed will only increase with model size and complexity. Furthermore, a sample-efficient DRL algo-

83 84 85 86 87 88 89 90 91 92 93