Page 70 - Read Online
P. 70
Lv et al. Intell Robot 2022;2(2):16879 Intelligence & Robotics
DOI: 10.20517/ir.2022.09
Research Article Open Access
Opponent modeling with trajectory representation clus-
tering
Yongliang Lv, Yan Zheng, Jianye Hao
College of Intelligence and Computing, Tianjin University, Tianjin 300000, China.
Correspondence to: Yongliang Lv, College of Intelligence and Computing, Tianjin University, No. 135 Yaguan Road, Jinnan District,
Tianjin 300000, China. E-mail: Lvyongliang@tju.edu.cn
Howtocitethisarticle: Lv Y, Zheng Y, Hao J.Opponent modeling with trajectory representation clustering. IntellRobot
2022;2(2):168-79. http://dx.doi.org/10.20517/ir.2022.09
Received: 9 Mar 2022 First Decision: 25 Apr 2022 Revised: 16 May 2022 Accepted: 1 Jun 2022 Published: 16 Jun 2022
Academic Editor: Simon X. Yang Copy Editor: Fanglin Lan Production Editor: Fanglin Lan
Abstract
For a non-stationary opponent in a multi-agent environment, traditional methods model the opponent through its
complex information to learn one or more optimal response policies. However, the response policy learned earlier is
prone to catastrophic forgetting due to data imbalance in the online-updated replay buffer for non-stationary changes
of opponent policies. This paper focuses on how to learn new response policies without forgetting old policies that
have been learned when the opponent policy is constantly changing. We extract the representation of opponent
policies and make explicit clustering distinctions through the contrastive learning autoencoder. With the idea of
balancing the replay buffer, we maintain continuous learning of the trajectory data of various opponent policies that
have appeared to avoid policy forgetting. Finally, we demonstrate the effectiveness of the method under a classical
opponent modeling environment (soccer) and show the clustering effect of different opponent policies.
Keywords: Non-stationary, opponent modeling, contrastive learning, trajectory representation, data balance
1. INTRODUCTION
In the field of multi-agent reinforcement learning (MARL) [1–3] , the non-stationary problem [4,5] caused by
policy changes of other agents has always been challenging. Since the policy and behavior of other agents are
generallyunknownwhenthepoliciesofotheragentschange, theenvironmentisnolongerconsidermedtobea
stationary arkov decision process (MDP), and it cannot be solved by simply using a single-agent reinforcement
© The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0
International License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, shar
ing, adaptation, distribution and reproduction in any medium or format, for any purpose, even commercially, as long as you
give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate
if changes were made.
www.intellrobot.com