Page 71 - Read Online
P. 71

Page 169                          Lv et al. Intell Robot 2022;2(2):168­79  I http://dx.doi.org/10.20517/ir.2022.09

               learning algorithm [6–8] . A common class of ideas is to introduce additional information to aid training by
               modeling other agents i.e. opponent modeling [4,9] .

               Opponentmodelingisa commonidea intheMARL domain, whichhasmany worksofdifferentpointsofview,
               such as explicitly representing the opponent’s policies through neural networks to train some optimal response
               policies [10–12]  or implicitly learning the opponent policy’s representation to assist training [13–16] . Since the
               goal of the agent under our control is to maximize its local reward, other agents are viewed collectively as an
               opponent, although “opponent” does not always imply a fully competitive environment. However, existing
               opponent modeling methods, whether explicitly or implicitly, set the opponent to use a fixed policy or switch
               between fixed policies, which is not suitable for most real-world situations. Therefore, we further set the
               opponent policy in the form of a probability distribution, so as to learn a general policy that can deal with all
               kinds of opponents, which requires additional consideration of policy forgetting.

               Specifically, when the opponent policy changes, the data in the replay buffer [17]  are constantly replaced by the
               interactive trajectory with the new opponent policy so that the agent’s response policy converges to deal with
               the new opponent policy. However, at the same time, the agent may forget the response policy it has learned
               before because of the loss of previous interaction data; therefore, it still needs to re-learn when some opponent
               policies appear again, which greatly reduces the response efficiency.


               We believe that the main reason for this type of policy forgetting problem is that there are not enough tra-
               jectories of interactions with various opponent policies saved in the replay buffer. Thus, this paper uses the
               idea of data balancing [18,19]  to ensure the diversity of trajectories interacting with various opponent policies
               in the replay buffer as much as possible. Data balancing is widely used in continuous learning [20]  to solve
               catastrophic forgetting problems. In contrast, in most continuous learning settings, task IDs are given to dis-
               tinguish between different tasks, but we do not know the types of opponent policies. Thus, to distinguish vari-
               ous trajectories, we self-supervise extracted policy representations from interactive trajectories by contrastive
               learning [21–24]  and clustering at the representational level. Our proposed method, trajectory representation
               clustering (TRC), can be combined with any existing reinforcement learning (RL) algorithm, to avoid policy
               forgetting in non-stationary multi-agent environments.

               The contributions of this paper can be summarized as follows: (1) Interaction trajectories are self-supervised
               encoded through a contrastive learning algorithm so that different opponent policies can be more accurately
               represented and distinguished in the representation space. No additional information is required except the
               opponent observation; (2) From the perspective of balancing data types, we artificially retain the types of data
               that account for a small proportion in the replay buffer to avoid catastrophic policy forgetting.


               The rest of this paper is organized as follows. The related work on opponent modeling and contrastive learning
               is discussed in Section 2. Section 3 details the used network architecture, loss function, and algorithm flow.
               Then, some experiments based on the classic environment of soccer are presented to verify the performance
               of our method in Section 4. Finally, the conclusions and future work are introduced in Section 5.



               2. RELATED WORK
               2.1. Opponent modeling
               Opponent modeling stems from a naive motivation that infers the opponent’s policy and behavior through
               the information about the opponent to obtain a higher reward for itself. Early opponent modeling work [25,26]
               mainly focused on simple game scenarios where the opponent policy is fixed. With the development of deep
               reinforcement learning, scholars have begun to apply the idea of opponent modeling in more complex environ-
               ments and settings. The following introduces the opponent modeling work in recent years in terms of explicit
   66   67   68   69   70   71   72   73   74   75   76