Page 75 - Read Online

P. 75

Page 173 Lv et al. Intell Robot 2022;2(2):16879 I http://dx.doi.org/10.20517/ir.2022.09

We cluster the trajectory data in the replay buffer in the representation space, and, for the representation, :

and of the trajectories and , we use the average Euclidean distance to measure the distance between
:
them:
−1
1 ∑
, = || , || (6)
ℎ ℎ
ℎ=0
For all trajectories in the replay buffer, we can calculate the representation distance matrix by Equation
(6). Additionally, the truncation method can be used for trajectory representations of different lengths, or the
dynamic time warping (DTW) can be used instead of the Euclidean distance.

Since the number of opponent policies is unknown, some clustering methods such as K-means are not suitable
for use. We use agglomerative clustering to distinguish trajectory representations in the replay buffer, which is
implemented in the standard algorithm library scikit-learn, and the clustering threshold is set as the average
distance of all trajectory representations. Then, the labels of the trajectories that interact with the opponents
are obtained in a self-supervised manner.

To balance the proportion of different types of data in the replay buffer, we no longer pop the oldest data when
the replay buffer is full, but pop the oldest data from the largest class based on the clustering results. This
ensures the dynamic balance of various types of data to a certain extent. Even if a certain type of opponent
policy has a very low probability of appearing in a period, the data interacting with it can maintain a certain
proportion in the replay buffer, thereby avoiding policy forgetting. However, this approach will lead to some
useless old data existing in the replay buffer for a long time, reducing the training effect of reinforcement
learning. We introduce a probability threshold , where the replay buffer pops the oldest data from the largest
class with the probability of and pops the oldest data from the entire replay buffer with the probability of
1 − . This allows the data that hinder training to be popped. In this paper, we set = 0.9.

3.4. Combine with reinforcement learning
This section describes the overall algorithm flow in combination with the classic reinforcement learning algo-
rithm soft actor–critic (SAC) whose optimization goal is:

∑
( ) = E (s ,a )∼ [ (s ,a ) + H ( (· | s ))] (7)
=0
where H ( (· | s )) is the additional policy entropy added to encourage exploration and is the temperature
parameter determines the relative importance of the entropy term. However, our method can be combined
with any off-policy reinforcement learning algorithm.

Sincethetrainingspeedofrepresentationlearningismuchfasterthanthatofreinforcementlearning,wesetthe
trainingfrequency forittobalancetheirlearningrate. Inaddition, thisalsoconsidersthattheflowofdatain
the replay buffer in the short term will not change the data distribution in it. Considering that the introduction
of the clustering requires a large amount of extra computation, and the data that newly entered replay buffer

will not be popped in the short term, we update the labels of trajectory representations by clustering every
episode.

ThecompletealgorithmisdescribedinAlgorithm1. Thetrainingofrepresentationlearningandreinforcement
learningprocessalternately. SincetheFIFOruleisstillfollowedintheclass,ourmethodwillnothavetoomuch
influence on the training of reinforcement learning; at the same time, the diversity of the data in the replay
buffer is guaranteed as much as possible, so that the policy forgetting caused by the non-stationary of the
opponent policy is avoided.

70 71 72 73 74 75 76 77 78 79 80