Page 74 - Read Online

P. 74

Lv et al. Intell Robot 2022;2(2):16879 I http://dx.doi.org/10.20517/ir.2022.09 Page 172

Figure 1. Overview of contrastive predictive coding (CPC), a representation extraction algorithm by contrasting positive and negative sam-
ples.The context and subsequent state embeddings { +1 , +2 , . . . , −1 } are regarded as positive samples when they come from the same
trajectory; otherwise, they are regarded as negative samples. By increasing the similarity between positive samples and reducing the simi-
larity between negative samples, we obtain trajectory representations to distinguish different opponent policies.

maximize ( , ) when = and minimize ( , ) when ≠ . The InfoNCE loss is:
+ +
 
 
1 − −1   ( ( + , ))  
∑ ∑

L InfoNCE = − log  (5)

( − − 1)  ∑ 

=1 =1  ( ( , )) 
 =1 + 
 
where is random sampling within a suitable range, M is the size of the trajectory set (batch size), and H is
the horizon. Optimizing this loss will extract a unique representation of each trajectory that is different from
others.
As described above, we self-supervise the extraction of policy representations from trajectories through con-
trastive learning, which can discriminate different opponent policies in representation space. Especially the
contrast between positive and negative samples makes the representation highlight the differences between
trajectories, which is beneficial for subsequent clustering operations.

3.3. Experience replay module
In Section 3.2, we introduce how to extract the representations of opponent policies from the trajectories that
interact with opponents. Different from the previous approaches of directly using representations to assist
training, we focus on another aspect, that is, the impact of non-stationary opponents on the experience replay.
Experience replay is a commonly used method in reinforcement learning whose purpose is to improve the
sample efficiency. When the replay buffer is full, the data are usually processed in a first-in, first-out (FIFO)
manner. When the opponent uses a fixed policy, the environment can be treated as a deterministic MDP, and
FIFO is feasible. When the opponent is non-stationary, the replay buffer will pass through data that interacts
with different types of opponent policies. The decrease in the proportion of certain types of data will affect the
effectiveness against such an opponent, and the loss of old data may lead to the forgetting of learned strategies.
Therefore, we design new data in and out, a mechanism to keep as many types of trajectory data as possible in
the replay buffer.

69 70 71 72 73 74 75 76 77 78 79