Page 73 - Read Online
P. 73

Page 171                          Lv et al. Intell Robot 2022;2(2):168­79  I http://dx.doi.org/10.20517/ir.2022.09


               3. METHOD
               3.1. Problem formulation
               We describe the problem as a partially-observable stochastic game (POSG) [32]  composed of a finite set I =
                                                                               
                                                                  1
               {1, 2, . . . ,   }, a state space S, the joint action space A = A × . . . × A , the joint observation space O =
                            
               O × . . . × O , a transition function P : S × A × S → [0, 1] denoting the transition probabilities between
                 1
               two states when given a joint action, and a reward function R    : S × A × S → R for each agent    ∈ I. Since
               we only focus on the performance of the agent under our control, we denote this agent by 1 and other agents
               are denoted by −1 with joint observation    and joint action    .
                                                   −1
                                                                    −1
                                                          {                }
               We design a set of    fixed policies for agent −1 Π =    −1,    |    = 1, . . . ,    , which can be rule-based (artificial)
               or network (pre-trained). Assume that an opponent policy    is a probability distribution on Π,    ∈ Δ(Π),
                                                                    
               which means at the beginning of each episode, the agent −1 samples a policy from Π with probability distribu-
               tion    and executes this policy throughout the episode. Different from the setting of only switching between
               fixed policies in previous work, we allow more complex opponent policy changes, making the setting looser
               and more general.


               Ideally, our goal is to find a general response policy       parameterized by   , which can maximize the reward of
               agent 1 against each opponent policy    . However, considering the reality, we maximize the minimum reward
                                                 
               of       against    :
                             
                                                                [       ]
                                                                   −1
                                                                 ∑      1
                                                 max min E       ,                                     (1)
                                                        ∈Δ(Π)
                                                                   =0
               where    is the reward of agent 1 at step    after performing the action    determined by      ,    is the horizon
                      1
                                                                           1
                        
                                                                             
               of episode, and    ∈ (0, 1) is a discount factor. It is also important to note that at any time the agent 1 knows
               neither the policy type    of opponent nor the distribution   . This gives us the motivation to infer the type of
               opponent policy.
               3.2. Representation extraction module
               Different from previous opponent modeling methods that model opponent policies, we use a contrastive learn-
               ing approach to self-supervised distinguish trajectories against different opponent policies, so that we only use
                                                                    1  −1   =  −1    1      −1
               the opponent’s observations. We denote trajectory as    = {   ,    }   =0  where    and        are the obser-
                                                                      
                                                                         
                                                                                       
               vations of agent 1 and agent −1 at step t, respectively. Given a set of trajectories T = {   1 ,    2 , . . . ,       }, the
               representation of each trajectory is self-supervised extracted by the CPC [21]  algorithm.
               Figure 1 shows the architecture of the contrastive predictive coding algorithm. First, we encode the observa-
               tions by a multi-layer perceptron (MLP) to get a sequence of latent representation      :

                                                               1  −1
                                                           =    MLP (   ,    )                         (2)
                                                                 
                                                                    
               Then, use a gated recurrent unit (GRU) to extract the context information for the first t steps:

                                                                                                       (3)
                                                             =    GRU (   :   )


               In addition, we also need to define the similarity function      . To unify the dimensions, we use a bilinear
               product function:
                                                         (     +   ,       ) =         +                (4)

               where    = 1, . . . ,    −    − 1 and       is different for each k. For given set of trajectory T = {   1 ,    2 , . . . ,       },         
               and        are calculated from      . Since representations extracted from the same trajectory have similarities, we
                      +  
   68   69   70   71   72   73   74   75   76   77   78