Page 71 - Read Online

P. 71

Page 169 Lv et al. Intell Robot 2022;2(2):16879 I http://dx.doi.org/10.20517/ir.2022.09

learning algorithm [6–8] . A common class of ideas is to introduce additional information to aid training by
modeling other agents i.e. opponent modeling [4,9] .

Opponentmodelingisa commonidea intheMARL domain, whichhasmany worksofdifferentpointsofview,
such as explicitly representing the opponent’s policies through neural networks to train some optimal response
policies [10–12] or implicitly learning the opponent policy’s representation to assist training [13–16] . Since the
goal of the agent under our control is to maximize its local reward, other agents are viewed collectively as an
opponent, although “opponent” does not always imply a fully competitive environment. However, existing
opponent modeling methods, whether explicitly or implicitly, set the opponent to use a fixed policy or switch
between fixed policies, which is not suitable for most real-world situations. Therefore, we further set the
opponent policy in the form of a probability distribution, so as to learn a general policy that can deal with all
kinds of opponents, which requires additional consideration of policy forgetting.

Specifically, when the opponent policy changes, the data in the replay buffer [17] are constantly replaced by the
interactive trajectory with the new opponent policy so that the agent’s response policy converges to deal with
the new opponent policy. However, at the same time, the agent may forget the response policy it has learned
before because of the loss of previous interaction data; therefore, it still needs to re-learn when some opponent
policies appear again, which greatly reduces the response efficiency.

We believe that the main reason for this type of policy forgetting problem is that there are not enough tra-
jectories of interactions with various opponent policies saved in the replay buffer. Thus, this paper uses the
idea of data balancing [18,19] to ensure the diversity of trajectories interacting with various opponent policies
in the replay buffer as much as possible. Data balancing is widely used in continuous learning [20] to solve
catastrophic forgetting problems. In contrast, in most continuous learning settings, task IDs are given to dis-
tinguish between different tasks, but we do not know the types of opponent policies. Thus, to distinguish vari-
ous trajectories, we self-supervise extracted policy representations from interactive trajectories by contrastive
learning [21–24] and clustering at the representational level. Our proposed method, trajectory representation
clustering (TRC), can be combined with any existing reinforcement learning (RL) algorithm, to avoid policy
forgetting in non-stationary multi-agent environments.

The contributions of this paper can be summarized as follows: (1) Interaction trajectories are self-supervised
encoded through a contrastive learning algorithm so that different opponent policies can be more accurately
represented and distinguished in the representation space. No additional information is required except the
opponent observation; (2) From the perspective of balancing data types, we artificially retain the types of data
that account for a small proportion in the replay buffer to avoid catastrophic policy forgetting.

The rest of this paper is organized as follows. The related work on opponent modeling and contrastive learning
is discussed in Section 2. Section 3 details the used network architecture, loss function, and algorithm flow.
Then, some experiments based on the classic environment of soccer are presented to verify the performance
of our method in Section 4. Finally, the conclusions and future work are introduced in Section 5.

2. RELATED WORK
2.1. Opponent modeling
Opponent modeling stems from a naive motivation that infers the opponent’s policy and behavior through
the information about the opponent to obtain a higher reward for itself. Early opponent modeling work [25,26]
mainly focused on simple game scenarios where the opponent policy is fixed. With the development of deep
reinforcement learning, scholars have begun to apply the idea of opponent modeling in more complex environ-
ments and settings. The following introduces the opponent modeling work in recent years in terms of explicit

66 67 68 69 70 71 72 73 74 75 76