Page 84 - Read Online

P. 84

Zhang et al. Intell Robot 2022;2(3):27597 I http://dx.doi.org/10.20517/ir.2022.20 Page 277

without systematically combing through the relevant literature and without an in-depth analysis of the existing
open problems and future research directions.

In this survey, we focus on quadrupedal locomotion research from the perspective of algorithm design, key
challenges, and future research directions. The remainder of this review is organized as follows. Section 2 for-
mulates the basic settings in DRL and lists several important issues that should be alleviated. The classification
and core components of the current algorithm design (e.g., the DRL algorithm, simulation environment, hard-
ware platform, observation, action, and reward function) are introduced in Section 3. Finally, we summarize
and offer perspectives on potential future research directions in this field.

2. BASIC SETTINGS AND LEARNING PARADIGM
In this section, we first formulate the basic settings of standard reinforcement learning problems and then
introduce the common learning paradigm.

Quadrupedal locomotion is commonly formulated as a reinforcement learning (RL) problem, which in the
framework of Markov decision processes (MDPs) is specified by the tuple := (S, A, , , 0 , ), where S
and A denote the state and action spaces, respectively; : S × A → R is the reward function; (s |s,a) is
′
the stochastic transition dynamics; 0 (s) is the initial state distribution; and ∈ [0, 1] is the discount factor.
The objective is to learn a control policy that enables a legged robot to maximize its expected return for a
given task [19] . A state is observed by the robot from the environment at each time step , and an action
a ∼ (a | s ) is derived from robot’s policy . The robot next applies this action, which results in a novel
state +1 and a scalar reward = (s ,a ). As a result, a trajectory := (s 0 ,a 0 , 0 ,s 1 ,a 1 , 1 , ...) is obtained
by repeating applications of this interaction process. Formally, the RL problem requires the robot to learn a
decision making policy (a|s) that maximizes the expected discounted return:

[ ]
−1
∑
, (1)
J ( ) := E ∼ ( )
=0
∏ −1
where denotes the time horizon of each episode and ( ) = (s 0 ) =0 (s +1 | s ,a ) (a | s ) repre-
sents the likelihood of a trajectory under a given policy , with (s 0 ) being the initial state distribution.

For quadrupedal locomotion tasks, most of the current research is based on a similar learning paradigm, as
shown in Figure 2. First, we need to build a simulation environment (e.g., ground, steps, and stairs), and then
design the state and action space, reward function and other essential elements. DRL-based algorithms are
further designed and used to train policies in the simulation. The trained policy is finally deployed on the real
robot to complete the assigned task.

3. DRL-BASED CONTROL POLICY DESIGN FOR QUADRUPEDAL LOCOMOTION
Inthis section, wedetail thekey componentsof aDRL-based controller. The classification resultsare presented
in Tables 1 and 2 in the Appendix. After the most relevant publications in this field are summarized, their
key parts are further condensed. As shown in Figure 3, we firstly review and analyze the general state and
development trend of current research (e.g., DRL algorithms, simulators, and hardware platforms). Then,
important components of DRL algorithm (state and action design, reward function design, solution to reality
gap, etc.) are presented, as shown in Figure 4. These specific designs would help to alleviate open questions,
which are further discussed in Section 4. Please refer to the Appendix for more details.

79 80 81 82 83 84 85 86 87 88 89