Page 45 - Read Online
P. 45

Qi et al. Intell Robot 2021;1(1):18-57  I http://dx.doi.org/10.20517/ir.2021.02      Page 40


               which offers advantages over plain MARL in terms of solution quality and computation time.


               SomeMARLalgorithmsalsoconcentrateonthecommunicationissueofPOMDP.InthestudyfromSukhbaatar
               et al. [83] , communication between the agents is performed for a number of rounds before their action is se-
               lected. The communication protocol is learned concurrently with the optimal policy. Foerster et al. [84]  pro-
               poses a deep recursive network architecture, i.e., deep distributed recurrent Q-network (DDRQN), to address
               the communication problem in a multi-agent partially-observable setting. This work makes three fundamen-
               tal modifications to previous algorithms. The first one is last-action inputs, which let each agent access its
               previous action as an input for the next time-step. Besides, inter-agent weight sharing allows diverse behavior
               between agents, as the agents receive different observations and thus evolve in different hidden states. The
               final one is disabling experience replay, which is because the non-stationarity of the environment renders old
               experiences obsolete or misleading. Foerster et al. [84]  considers the communication task of fully cooperative,
               partially observable, sequential multi-agent decision-making problems. In their system model, each agent can
               receive a private observation and take actions that affect the environment. In addition, the agent can also
               communicate with its fellow agents via a discrete limited-bandwidth channel. Despite the partial observability
               and limited channel capacity, authors achieved the task that the two agents could discover a communication
               protocol that enables them to coordinate their behavior based on the approach of deep recurrent Q-networks.


               While there are some similarities between MARL and VFRL, several important differences have to be paid
               attention to, i.e.,

                • VFRL and some MARL algorithms are able to address similar problems, e.g., the issues of POMDP. How-
                  ever, there are differences between the solution ideas between two algorithms. Since VFRL is the product
                  of applying VFL to RL, the FL component of VFRL focuses more on the aggregation of partial features, in-
                  cluding states and rewards, observed by different agents since VFRL inception. Security is also an essential
                  issue in VFRL. On the contrary, MARL may arise as the most natural way of adding more than one agent
                  in a RL system [85] . In MARL, agents not only interact with the environment, but also have complex inter-
                  active relationships with other agents, which creates a great obstacle to the solution of policy optimization.
                  Therefore, the original intentions of two algorithms are different.
                • Two algorithms are slightly different in terms of the structure. The agents in MARL must surely have the
                  reward even some of them may not have their own local actions. However, in some cases, the agents in
                  VFRL are not able to generate a corresponding operation policy, so in these cases, some agents have no
                  actions and rewards [65] . Therefore, VFRL can solve more extensive problems that MARL is not capable of
                  solving.
                • Both two algorithms involve the communication problem between agents. In MARL, information such
                  as the states of other agents and model parameters can be directly and freely propagated among agents.
                  During communication, some MARL methods such as DDRQN in the work of Foerster et al. [84]  consider
                  the previous action as an input for the next time-step state. Weight sharing is also allowed between agents.
                  However, VFRL assumes states cannot be shared among agents. Since these agents do not exchange ex-
                  perience and data directly, VFRL focuses more on security and privacy issues of communication between
                  agents, as well as how to process mid-products transferred by other agents and aggregate federated models.

               In summary, as a potential and notable algorithm, VFRL has several advantages as follows, i.e.,
                • Excellent privacy protection. VFRL inherits the FL algorithm’s idea of data privacy protection, so for the
                  task of multiple agents cooperation in the same environment, information interaction can be carried out
                  confidently to enhance the learning efficiency of RL model. In this process, each participant does not have
                  to worry about any leakage of raw real-time data.
                • Wide application scenarios. With appropriate knowledge extraction methods, including algorithm design
                  and system modeling, VFRL can solve more real-world problems compared with MARL algorithms. This
   40   41   42   43   44   45   46   47   48   49   50