Page 88 - Read Online
P. 88

Zhang et al. Intell Robot 2022;2(3):275­97  I http://dx.doi.org/10.20517/ir.2022.20  Page 281

                                                                        Accelerate learning via
                                                                        Model-based DRL
                                        Open issues                     model-based planning
                                             Generalization
                                  Sample efficiency          Future
                                             and adaptation  research   Reuse of motion priors
                                                            directions        data
                                     Partial
                                   observation  Reality gap
                                                                       Large-scale pretraining of
                                                                           DRL models

               Figure 5. In the DRL-based real-world quadrupedal locomotion field, open problems mainly include sample efficiency, generalization and
               adaptation, partial observation, and reality gap. Future research directions are highlighted and pointed out around these open problems.
               Based on the current research states of quadrupedal locomotion, we expound the future research prospects from multiple perspectives.
               In particular, world models, skill data, and pre-trained models require significant attention, as these directions will play an integral role in
               realizing legged robot intelligence.


               torque.


               Manystudieshavealso consideredadditional information, suchas trajectory generators [46,49–51] , controlmeth-
               ods [52–54] , motion data [10,12,55,56] , etc. Trajectory generators and control methods mainly introduce prior
               knowledge in the action space, narrowing the search range of DRL control policies, which greatly improves
               the sample efficiency under a simple reward function. Motion data are often generated by other suboptimal
               controllers or assessed via public datasets. Through imitation learning based on the motion data, the robot can
               master behaviors and skills such as walking and turning. In both simulations and real-world deployment, the
               robot eventually manages to generate natural and agile movement patterns and completes the assigned tasks
               according to the external reward function.

               3.6. Solution to reality gap
               Under the current mainstream learning paradigm, the reality gap is an unavoidable problem that must be
               addressed. The domain randomization method is used by most researchers due to its simplicity and effec-
               tiveness. The difference between simulation and real environment is mainly reflected in physical parameters
               and sensors. Therefore, researchers mainly randomize physical parameters (mass, inertia, motor strength, la-
               tency, ground friction, etc.), add Gaussian noise to observations, and apply disturbing force, etc. [35,48,50,57,58] .
               However, domain randomization methods trade optimality for robustness, which can lead to conservative
               controllers [59] . Some studies have also used domain adaptation methods, that is, use real data to identify the
               environment [60,61]  or obtain accurate physical parameters [62] . Furthermore, these methods can improve the
               generalization (adaptation) performance of robots in challenging environments. For more solutions to the
               reality gap, please refer to the relevant review paper [63] .



               4. OPEN PROBLEMS AND FUTURE PROSPECTS
               In this section, we discuss the long-standing open questions and promising future research directions in the
               DRL-based quadrupedal locomotion field around these issues, as shown in Figure 5. Solutions to these open
               problems are described in Section 3.


               4.1 Open problems
               4.1.1. Sample efficiency
               In many popular DRL algorithms, millions or billions of gradient descent steps are required to train policies
               that can accomplish the assigned task [64–66] . For real robotics tasks, therefore, such a learning process requires
               a significant number of interactions, which is infeasible in practical applications. In the face of increasingly
               complex robotic tasks, without improvement in the sample efficiency of algorithms, the number of training
               samples needed will only increase with model size and complexity. Furthermore, a sample-efficient DRL algo-
   83   84   85   86   87   88   89   90   91   92   93