Page 69 - Read Online
P. 69

Page 64                                                                                                       Harib et al. Intell Robot 2022;2(1):37-71  https://dx.doi.org/10.20517/ir.2021.19



                          Table 6. DRL for robotic manipulation categorized by state and action space, algorithm and reward design

                          State space                                                                         Action space                Algorithm type                    Reward shaping
                          Levine et al. [176]  (2015)

                          Joint angles and velocities                                           Joint torque                              Trajectory optimization algorithm.  A penalty term is shaped as the sum of a
                                                                                                                                                                            quadratic term, and a Lorentzian ρ-function
                                                                                                                                                                            The first term encourages speed while the
                                                                                                                                                                            second term encourages precision
                                                                                                                                                                            In addition, a quadratic penalty is applied to
                                                                                                                                                                            joint velocities and torques to smooth and
                                                                                                                                                                            control motions
                          Andrychowicz et al. [180]  (2017)

                          Joint angles                                                          4D action space. The first three are position   HER combined with any off-policy   Binary and sparse rewards
                          & velocities + Objects’ positions, rotations & velocities             related, the last one specifies the desired   RL algorithm, like DDPG
                                                                                                distance
                          Vecerik et al. [178]  (2018)
                          Joint position and velocity, joint torque, and global pose of the socket and   Joint velocities                 An off-policy RL algorithm, called   First is a sparse reward function: +10 if the plug
                          plug                                                                                                            DDPGfD, is based on imitation     is within a small tolerance of the goal
                                                                                                                                          learning                          The second reward is shaped by two terms: a
                                                                                                                                                                            reaching phase for alignment and an inserting
                                                                                                                                                                            phase to reach the goal

                          Gupta et al. [181]  (2021)
                          Depends on the agent morphology and include joint angles, angular     Chosen via a stochastic policy determined by   DERL, which is a simple      Two reward components. First relative to
                          velocities, readings of a velocimeter, accelerometer, and a gyroscope   the parameters of a deep NN that are learned   computational              velocity and second relative to actuators’ input
                          positioned at the head, and touch sensors attached to the limbs and head  via proximal policy optimization (PPO)  framework operating by mimicking
                                                                                                                                          the intertwined
                                                                                                                                          processes of Darwinian evolution


                          DRL: Deep reinforcement learning; HER: Hindsight Experience Replay; DDPG: Deep Deterministic Policy Gradient.



                          techniques, or to merge the analytical and AI approaches such that some functions are done analytically and the remainder are performed using AI techniques.



                          We then briefly presented RL and DRL before we surveyed the previous work implementing both techniques in robot manipulation specifically. From this

                          overview, it was clear that RL and DRL for robotics are not ready to offer a straightforward task yet. Although both techniques have evolved rapidly over the
                          past few years with a wide range of applications, there is still a huge gap between theory and practice. The discrepancy between what we intend to solve and
                          what we solve in practice, and accurately explaining the differences and how this affects our solution, we believe, is one of the core difficulties that plague the

                          RL/DRL research community.
   64   65   66   67   68   69   70   71   72   73   74