Page 75 - Read Online
P. 75

Page 173                          Lv et al. Intell Robot 2022;2(2):168­79  I http://dx.doi.org/10.20517/ir.2022.09

               We cluster the trajectory data in the replay buffer in the representation space, and, for the representation,       :  
                      
               and    of the trajectories       and       , we use the average Euclidean distance to measure the distance between
                    :  
               them:
                                                             −1
                                                         1  ∑          
                                                         ,   =  ||   ,    ||                           (6)
                                                                 ℎ  ℎ
                                                            ℎ=0
               For all trajectories in the replay buffer, we can calculate the representation distance matrix    by Equation
               (6). Additionally, the truncation method can be used for trajectory representations of different lengths, or the
               dynamic time warping (DTW) can be used instead of the Euclidean distance.

               Since the number of opponent policies is unknown, some clustering methods such as K-means are not suitable
               for use. We use agglomerative clustering to distinguish trajectory representations in the replay buffer, which is
               implemented in the standard algorithm library scikit-learn, and the clustering threshold is set as the average
               distance of all trajectory representations. Then, the labels of the trajectories that interact with the opponents
               are obtained in a self-supervised manner.

               To balance the proportion of different types of data in the replay buffer, we no longer pop the oldest data when
               the replay buffer is full, but pop the oldest data from the largest class based on the clustering results. This
               ensures the dynamic balance of various types of data to a certain extent. Even if a certain type of opponent
               policy has a very low probability of appearing in a period, the data interacting with it can maintain a certain
               proportion in the replay buffer, thereby avoiding policy forgetting. However, this approach will lead to some
               useless old data existing in the replay buffer for a long time, reducing the training effect of reinforcement
               learning. We introduce a probability threshold   , where the replay buffer pops the oldest data from the largest
               class with the probability of    and pops the oldest data from the entire replay buffer with the probability of
               1 −   . This allows the data that hinder training to be popped. In this paper, we set    = 0.9.

               3.4. Combine with reinforcement learning
               This section describes the overall algorithm flow in combination with the classic reinforcement learning algo-
               rithm soft actor–critic (SAC) whose optimization goal is:

                                                 
                                              ∑
                                          (  ) =  E (s    ,a    )∼       [   (s    ,a    ) +   H (   (· | s    ))]  (7)
                                                =0
               where H (   (· | s    )) is the additional policy entropy added to encourage exploration and    is the temperature
               parameter determines the relative importance of the entropy term. However, our method can be combined
               with any off-policy reinforcement learning algorithm.

               Sincethetrainingspeedofrepresentationlearningismuchfasterthanthatofreinforcementlearning,wesetthe
               trainingfrequency       forittobalancetheirlearningrate. Inaddition, thisalsoconsidersthattheflowofdatain
               the replay buffer in the short term will not change the data distribution in it. Considering that the introduction
               of the clustering requires a large amount of extra computation, and the data that newly entered replay buffer

               will not be popped in the short term, we update the labels of trajectory representations by clustering every      
               episode.

               ThecompletealgorithmisdescribedinAlgorithm1. Thetrainingofrepresentationlearningandreinforcement
               learningprocessalternately. SincetheFIFOruleisstillfollowedintheclass,ourmethodwillnothavetoomuch
               influence on the training of reinforcement learning; at the same time, the diversity of the data in the replay
               buffer is guaranteed as much as possible, so that the policy forgetting caused by the non-stationary of the
               opponent policy is avoided.
   70   71   72   73   74   75   76   77   78   79   80