Page 70 - Read Online
P. 70

Lv et al. Intell Robot 2022;2(2):168­79                     Intelligence & Robotics
               DOI: 10.20517/ir.2022.09


               Research Article                                                              Open Access



               Opponent modeling with trajectory representation clus-

               tering


               Yongliang Lv, Yan Zheng, Jianye Hao
               College of Intelligence and Computing, Tianjin University, Tianjin 300000, China.


               Correspondence to: Yongliang Lv, College of Intelligence and Computing, Tianjin University, No. 135 Yaguan Road, Jinnan District,
               Tianjin 300000, China. E-mail: Lvyongliang@tju.edu.cn
               Howtocitethisarticle: Lv Y, Zheng Y, Hao J.Opponent modeling with trajectory representation clustering. IntellRobot
               2022;2(2):168-79. http://dx.doi.org/10.20517/ir.2022.09
               Received: 9 Mar 2022  First Decision: 25 Apr 2022 Revised: 16 May 2022  Accepted: 1 Jun 2022 Published: 16 Jun 2022

               Academic Editor: Simon X. Yang Copy Editor: Fanglin Lan  Production Editor: Fanglin Lan


               Abstract
               For a non-stationary opponent in a multi-agent environment, traditional methods model the opponent through its
               complex information to learn one or more optimal response policies. However, the response policy learned earlier is
               prone to catastrophic forgetting due to data imbalance in the online-updated replay buffer for non-stationary changes
               of opponent policies. This paper focuses on how to learn new response policies without forgetting old policies that
               have been learned when the opponent policy is constantly changing. We extract the representation of opponent
               policies and make explicit clustering distinctions through the contrastive learning autoencoder. With the idea of
               balancing the replay buffer, we maintain continuous learning of the trajectory data of various opponent policies that
               have appeared to avoid policy forgetting. Finally, we demonstrate the effectiveness of the method under a classical
               opponent modeling environment (soccer) and show the clustering effect of different opponent policies.

               Keywords: Non-stationary, opponent modeling, contrastive learning, trajectory representation, data balance





               1. INTRODUCTION
               In the field of multi-agent reinforcement learning (MARL) [1–3] , the non-stationary problem [4,5]  caused by
               policy changes of other agents has always been challenging. Since the policy and behavior of other agents are
               generallyunknownwhenthepoliciesofotheragentschange, theenvironmentisnolongerconsidermedtobea
               stationary arkov decision process (MDP), and it cannot be solved by simply using a single-agent reinforcement




                           © The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0
                           International License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, shar­
                ing, adaptation, distribution and reproduction in any medium or format, for any purpose, even commercially, as long as you
                give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate
                if changes were made.



                                                                                            www.intellrobot.com
   65   66   67   68   69   70   71   72   73   74   75