Page 72 - Read Online

P. 72

Lv et al. Intell Robot 2022;2(2):16879 I http://dx.doi.org/10.20517/ir.2022.09 Page 170

modeling and implicit modeling.

2.1.1. Implicit opponent modeling
Implicit opponent modeling generally refers to extracting representations from opponent information to aid
training. He et al. first used the opponent’s observation and agent’s observation as merged input in a deep
network to train the agent end-to-end. They also pointed out that information such as the opponent’s policy
type can be used to assist the training of RL [13] . Subsequently, Hong et al. additionally used the information
of opponent action, fitted the opponent policy through the neural network, and then multiplied the output
of the hidden layer of the opponent’s policy network with the output of the hidden layer of the Q network
to calculate the Q value [14] . Considering that the opponent may also have learning behaviors, Foerster et al.
maximized the agent’s reward by estimating the parameters of the opponent policy network based on the idea
of recurrent reasoning [16] . Raileanu et al. considered the parameters of the opponent policy network from
another perspective and used the agent policy to make decisions based on the opponent observation, so as
to infer the opponent’s goal and achieve better performance [15] . Due to the different assumptions about the
opponent, the effects of different algorithms are also difficult to compare.

2.1.2. Explicit opponent modeling
Explicitopponentmodelinggenerallyreferstoexplicitlymodelingopponentpolicies, dividingopponenttypes,
and detecting and responding online during the interaction process. Rosman et al. first proposed Bayes policy
reuse (BPR) to be used in multi-task learning, maintaining a belief for each task through Bayesian formula,
judgingthetasktype, andchoosingtheoptimalresponsepolicyforunknowntasks [27] . Sincethen, Hernandez-
Leal et al. extended the environment to a multi-agent system, used MDP to model opponents, and added a
detectionmechanismforunknownopponentpolicies [10] . Inthefaceofmorecomplexenvironments,Zhenget
al. usedneuralnetworkstomodelopponentsandtherectifiedbeliefmodel(RBM)tomakeopponentdetection
more accurate and rapid, as well as policy distillation technology to reduce the scale of the network [11] . On
this basis, Yang et al. introduced the theory of mind [28] to defeat opponents with higher-level decision-making
methods for opponents who also use opponent modeling method [12] .

2.2. Contrastive learning
Contrastive learning, as the most popular self-supervised learning algorithm in recent years, is different from
generative encoding algorithms. Contrastive learning focuses on learning common features between similar
instances and distinguishing differences between non-similar instances. van den Oord et al. first proposed
InfoNCE loss, which encodes time-series data. By separating positive and negative samples, it can extract
data-specific representations [21] . Based on similar ideas, He et al. achieved high performance in the field of
image classification, by improving the similarity between the query vector and its corresponding key vector
while reducing the similarity with the key vector of other images [23] . From the perspective of data augmenta-
tion, Chen et al. performed random cropping, inversion, grayscale, and other transformations on the image
and extracted the invariant representation behind the image through contrastive learning [22] . The subsequent
series of works [29–31] continued with a series of improvements, and the performance on some tasks is close to
that of supervised learning algorithms.

From the above works, we can see that most of the previous opponent modeling work is to additionally input
representations into neural networks for policy training. This paper provides another perspective on training
a general policy to respond to various opponents by balancing the data in the replay buffer interacting with
different opponent policies. Through the powerful representation extraction ability of contrastive learning, we
distinguish various opponent policies at the representation level. It is worth noting that we only additionally
use opponent observations, which is a looser setting compared to other work in multi-agent settings.

67 68 69 70 71 72 73 74 75 76 77