Page 84 - Read Online
P. 84

Zhang et al. Intell Robot 2022;2(3):275­97  I http://dx.doi.org/10.20517/ir.2022.20  Page 277

               without systematically combing through the relevant literature and without an in-depth analysis of the existing
               open problems and future research directions.


               In this survey, we focus on quadrupedal locomotion research from the perspective of algorithm design, key
               challenges, and future research directions. The remainder of this review is organized as follows. Section 2 for-
               mulates the basic settings in DRL and lists several important issues that should be alleviated. The classification
               and core components of the current algorithm design (e.g., the DRL algorithm, simulation environment, hard-
               ware platform, observation, action, and reward function) are introduced in Section 3. Finally, we summarize
               and offer perspectives on potential future research directions in this field.



               2. BASIC SETTINGS AND LEARNING PARADIGM
               In this section, we first formulate the basic settings of standard reinforcement learning problems and then
               introduce the common learning paradigm.


               Quadrupedal locomotion is commonly formulated as a reinforcement learning (RL) problem, which in the
               framework of Markov decision processes (MDPs) is specified by the tuple    := (S, A,   ,   ,    0 ,   ), where S
               and A denote the state and action spaces, respectively;    : S × A → R is the reward function;   (s |s,a) is
                                                                                                   ′
               the stochastic transition dynamics;    0 (s) is the initial state distribution; and    ∈ [0, 1] is the discount factor.
               The objective is to learn a control policy    that enables a legged robot to maximize its expected return for a
               given task [19] . A state       is observed by the robot from the environment at each time step   , and an action
               a    ∼    (a    | s    ) is derived from robot’s policy   . The robot next applies this action, which results in a novel
               state      +1 and a scalar reward       =    (s    ,a    ). As a result, a trajectory    := (s 0 ,a 0 ,    0 ,s 1 ,a 1 ,    1 , ...) is obtained
               by repeating applications of this interaction process. Formally, the RL problem requires the robot to learn a
               decision making policy   (a|s) that maximizes the expected discounted return:


                                                                [      ]
                                                                  −1
                                                                ∑      
                                                                             ,                         (1)
                                                 J (  ) := E   ∼   (  )
                                                                   =0
                                                                          ∏   −1
               where    denotes the time horizon of each episode and    (  ) =    (s 0 )    =0     (s   +1 | s    ,a    )    (a    | s    ) repre-
               sents the likelihood of a trajectory    under a given policy   , with    (s 0 ) being the initial state distribution.


               For quadrupedal locomotion tasks, most of the current research is based on a similar learning paradigm, as
               shown in Figure 2. First, we need to build a simulation environment (e.g., ground, steps, and stairs), and then
               design the state and action space, reward function and other essential elements. DRL-based algorithms are
               further designed and used to train policies in the simulation. The trained policy is finally deployed on the real
               robot to complete the assigned task.



               3. DRL-BASED CONTROL POLICY DESIGN FOR QUADRUPEDAL LOCOMOTION
               Inthis section, wedetail thekey componentsof aDRL-based controller. The classification resultsare presented
               in Tables 1 and 2 in the Appendix. After the most relevant publications in this field are summarized, their
               key parts are further condensed. As shown in Figure 3, we firstly review and analyze the general state and
               development trend of current research (e.g., DRL algorithms, simulators, and hardware platforms). Then,
               important components of DRL algorithm (state and action design, reward function design, solution to reality
               gap, etc.) are presented, as shown in Figure 4. These specific designs would help to alleviate open questions,
               which are further discussed in Section 4. Please refer to the Appendix for more details.
   79   80   81   82   83   84   85   86   87   88   89