Page 73 - Read Online

P. 73

Page 171 Lv et al. Intell Robot 2022;2(2):16879 I http://dx.doi.org/10.20517/ir.2022.09

3. METHOD
3.1. Problem formulation
We describe the problem as a partially-observable stochastic game (POSG) [32] composed of a finite set I =

1
{1, 2, . . . , }, a state space S, the joint action space A = A × . . . × A , the joint observation space O =

O × . . . × O , a transition function P : S × A × S → [0, 1] denoting the transition probabilities between
1
two states when given a joint action, and a reward function R : S × A × S → R for each agent ∈ I. Since
we only focus on the performance of the agent under our control, we denote this agent by 1 and other agents
are denoted by −1 with joint observation and joint action .
−1
−1
{ }
We design a set of fixed policies for agent −1 Π = −1, | = 1, . . . , , which can be rule-based (artificial)
or network (pre-trained). Assume that an opponent policy is a probability distribution on Π, ∈ Δ(Π),

which means at the beginning of each episode, the agent −1 samples a policy from Π with probability distribu-
tion and executes this policy throughout the episode. Different from the setting of only switching between
fixed policies in previous work, we allow more complex opponent policy changes, making the setting looser
and more general.

Ideally, our goal is to find a general response policy parameterized by , which can maximize the reward of
agent 1 against each opponent policy . However, considering the reality, we maximize the minimum reward

of against :

[ ]
−1
∑ 1
max min E , (1)
∈Δ(Π)
=0
where is the reward of agent 1 at step after performing the action determined by , is the horizon
1
1

of episode, and ∈ (0, 1) is a discount factor. It is also important to note that at any time the agent 1 knows
neither the policy type of opponent nor the distribution . This gives us the motivation to infer the type of
opponent policy.
3.2. Representation extraction module
Different from previous opponent modeling methods that model opponent policies, we use a contrastive learn-
ing approach to self-supervised distinguish trajectories against different opponent policies, so that we only use
1 −1 = −1 1 −1
the opponent’s observations. We denote trajectory as = { , } =0 where and are the obser-

vations of agent 1 and agent −1 at step t, respectively. Given a set of trajectories T = { 1 , 2 , . . . , }, the
representation of each trajectory is self-supervised extracted by the CPC [21] algorithm.
Figure 1 shows the architecture of the contrastive predictive coding algorithm. First, we encode the observa-
tions by a multi-layer perceptron (MLP) to get a sequence of latent representation :

1 −1
= MLP ( , ) (2)

Then, use a gated recurrent unit (GRU) to extract the context information for the first t steps:

(3)
= GRU ( : )

In addition, we also need to define the similarity function . To unify the dimensions, we use a bilinear
product function:
( + , ) = + (4)

where = 1, . . . , − − 1 and is different for each k. For given set of trajectory T = { 1 , 2 , . . . , },
and are calculated from . Since representations extracted from the same trajectory have similarities, we
+

68 69 70 71 72 73 74 75 76 77 78