Page 74 - Read Online
P. 74

Lv et al. Intell Robot 2022;2(2):168­79  I http://dx.doi.org/10.20517/ir.2022.09    Page 172
































               Figure 1. Overview of contrastive predictive coding (CPC), a representation extraction algorithm by contrasting positive and negative sam-
               ples.The context       and subsequent state embeddings {     +1 ,      +2 , . . . ,      −1 } are regarded as positive samples when they come from the same
               trajectory; otherwise, they are regarded as negative samples. By increasing the similarity between positive samples and reducing the simi-
               larity between negative samples, we obtain trajectory representations to distinguish different opponent policies.


                                                                 
                                  
               maximize       (       ,    ) when    =    and minimize       (       ,    ) when    ≠   . The InfoNCE loss is:
                             +                               +      
                                                                                     
                                                                                     
                                                  1        −  −1             (       (     +    ,    ))   
                                                          ∑ ∑
                                                                                    
                                  L InfoNCE = −                  log                                 (5)
                                                                        
                                               (   −    − 1)        ∑                
                                                                                      
                                                            =1    =1         (       (    ,    )) 
                                                                      =1        +       
                                                                                     
               where    is random sampling within a suitable range, M is the size of the trajectory set (batch size), and H is
               the horizon. Optimizing this loss will extract a unique representation of each trajectory that is different from
               others.
               As described above, we self-supervise the extraction of policy representations from trajectories through con-
               trastive learning, which can discriminate different opponent policies in representation space. Especially the
               contrast between positive and negative samples makes the representation highlight the differences between
               trajectories, which is beneficial for subsequent clustering operations.


               3.3. Experience replay module
               In Section 3.2, we introduce how to extract the representations of opponent policies from the trajectories that
               interact with opponents. Different from the previous approaches of directly using representations to assist
               training, we focus on another aspect, that is, the impact of non-stationary opponents on the experience replay.
               Experience replay is a commonly used method in reinforcement learning whose purpose is to improve the
               sample efficiency. When the replay buffer is full, the data are usually processed in a first-in, first-out (FIFO)
               manner. When the opponent uses a fixed policy, the environment can be treated as a deterministic MDP, and
               FIFO is feasible. When the opponent is non-stationary, the replay buffer will pass through data that interacts
               with different types of opponent policies. The decrease in the proportion of certain types of data will affect the
               effectiveness against such an opponent, and the loss of old data may lead to the forgetting of learned strategies.
               Therefore, we design new data in and out, a mechanism to keep as many types of trajectory data as possible in
               the replay buffer.
   69   70   71   72   73   74   75   76   77   78   79