Page 72 - Read Online
P. 72

Lv et al. Intell Robot 2022;2(2):168­79  I http://dx.doi.org/10.20517/ir.2022.09    Page 170

               modeling and implicit modeling.


               2.1.1. Implicit opponent modeling
               Implicit opponent modeling generally refers to extracting representations from opponent information to aid
               training. He et al. first used the opponent’s observation and agent’s observation as merged input in a deep
               network to train the agent end-to-end. They also pointed out that information such as the opponent’s policy
               type can be used to assist the training of RL [13] . Subsequently, Hong et al. additionally used the information
               of opponent action, fitted the opponent policy through the neural network, and then multiplied the output
               of the hidden layer of the opponent’s policy network with the output of the hidden layer of the Q network
               to calculate the Q value [14] . Considering that the opponent may also have learning behaviors, Foerster et al.
               maximized the agent’s reward by estimating the parameters of the opponent policy network based on the idea
               of recurrent reasoning [16] . Raileanu et al. considered the parameters of the opponent policy network from
               another perspective and used the agent policy to make decisions based on the opponent observation, so as
               to infer the opponent’s goal and achieve better performance [15] . Due to the different assumptions about the
               opponent, the effects of different algorithms are also difficult to compare.


               2.1.2. Explicit opponent modeling
               Explicitopponentmodelinggenerallyreferstoexplicitlymodelingopponentpolicies, dividingopponenttypes,
               and detecting and responding online during the interaction process. Rosman et al. first proposed Bayes policy
               reuse (BPR) to be used in multi-task learning, maintaining a belief for each task through Bayesian formula,
               judgingthetasktype, andchoosingtheoptimalresponsepolicyforunknowntasks [27] . Sincethen, Hernandez-
               Leal et al. extended the environment to a multi-agent system, used MDP to model opponents, and added a
               detectionmechanismforunknownopponentpolicies [10]  . Inthefaceofmorecomplexenvironments,Zhenget
               al. usedneuralnetworkstomodelopponentsandtherectifiedbeliefmodel(RBM)tomakeopponentdetection
               more accurate and rapid, as well as policy distillation technology to reduce the scale of the network [11] . On
               this basis, Yang et al. introduced the theory of mind [28]  to defeat opponents with higher-level decision-making
               methods for opponents who also use opponent modeling method [12] .

               2.2. Contrastive learning
               Contrastive learning, as the most popular self-supervised learning algorithm in recent years, is different from
               generative encoding algorithms. Contrastive learning focuses on learning common features between similar
               instances and distinguishing differences between non-similar instances. van den Oord et al. first proposed
               InfoNCE loss, which encodes time-series data. By separating positive and negative samples, it can extract
               data-specific representations [21] . Based on similar ideas, He et al. achieved high performance in the field of
               image classification, by improving the similarity between the query vector and its corresponding key vector
               while reducing the similarity with the key vector of other images [23] . From the perspective of data augmenta-
               tion, Chen et al. performed random cropping, inversion, grayscale, and other transformations on the image
               and extracted the invariant representation behind the image through contrastive learning [22] . The subsequent
               series of works [29–31]  continued with a series of improvements, and the performance on some tasks is close to
               that of supervised learning algorithms.


               From the above works, we can see that most of the previous opponent modeling work is to additionally input
               representations into neural networks for policy training. This paper provides another perspective on training
               a general policy to respond to various opponents by balancing the data in the replay buffer interacting with
               different opponent policies. Through the powerful representation extraction ability of contrastive learning, we
               distinguish various opponent policies at the representation level. It is worth noting that we only additionally
               use opponent observations, which is a looser setting compared to other work in multi-agent settings.
   67   68   69   70   71   72   73   74   75   76   77