Page 54 - Read Online
P. 54

Page 49                                                                 Qi et al. Intell Robot 2021;1(1):18-57  I http://dx.doi.org/10.20517/ir.2021.02


               method toensurethatFLisnotdisrupted. The participatingdevices’attributes, includingcomputingresources
               and trust values, etc, are used as part of the environment in RL. In the aggregation of the global model, devices
               with high reputation levels will have a greater chance of being considered to reduce the effects of malicious
               devices mixed into FL.


               5.6.3. Lessons learned from categories of FRL
               As discussed above, FRL can be divided into two main categories, i.e., HFRL and VFRL. Currently, most of
               the existing research is focused on HFRL, while little attention is devoted to VFRL. The reason for this is
               that HFRL has obvious application scenarios, where multiple participants have similar decision-making tasks
               with individual environments, such as caching allocation [59] , offloading optimization [58] , and attack monitor-
               ing [108] . The participants and coordinator only need to train a similar model with the same state and action
               spaces. Consequently, the algorithm design can be implemented and the training convergence can be veri-
               fied relatively easily. On the other hand, even though VFRL has a higher degree of technical difficulty at the
               algorithm design level, it also has a wide range of possible applications. In a multi-agent scenario, for exam-
               ple, a single agent is limited by its ability to observe only part of the environment, whereas the transition of
               the environment is determined by the behavior of all the agents. Zhuo et al. [65]  assumes agents cannot share
               their partial observations of the environment and some agents are unable to receive rewards. The federated
               Q-network aggregation algorithm between two agents is proposed for VFRL. The paper [97]  specifically applies
               both HFRL and VFRL for radio access network slicing. For the same type of services, similar data samples
               are trained locally at participating devices, and BSs perform horizontal aggregation to integrate a cooperative
               access model by adopting an iterative approach. The terminal device also can optimize the selection of base
               stations and network slices based on the global model of VFRL, which aggregates access features generated
               by different types of services on the third encrypted party. The method improves the device’s ability to select
               the appropriate access points when initiating different types of service requests under restrictions regarding
               privacy protection. The feasible implementation of VFRL also provides guidance for future research.



               6. OPEN ISSUES AND FUTURE RESEARCH DIRECTIONS
               As we presented in the previous section, FRL serves an increasingly important role as an enabler of various
               applications. While the FRL-based approach possesses many advantages, there are a number of critical open
               issues to consider for future implementation. Therefore, this section focuses on several key challenges, in-
               cluding those inherited from FL such as security and communication issues, as well as those unique to FRL.
               Research on tackling these issues offers interesting directions for the future.


               6.1. Learning convergence in HFRL
               In realistic HFRL scenarios, while the agents perform similar tasks, the inherent dynamics for the different
               environments in which the agents reside are usually not exactly identically distributed. The slight difference in
               thestochasticpropertiesofthetransitionmodelsformultipleagentscouldcausethelearningconvergenceissue.
               Onepossiblemethod toaddressthis problemis by adjustingthe frequencyofglobal aggregation, i.e., aftereach
               global aggregation, a period of time is left for each agent to fine-tune its local parameters according to its own
               environment. Apart from the non-identical environment problem, another interesting and important problem
               is how to leverage FL to make RL algorithms converge better and faster. It is well-known that DRL algorithms
               could be unstable and diverge, especially when off-policy training is combined with function approximation
               and bootstrapping. In FRL, the training curves of some agents could diverge while others converge although
               the agents are trained in the exact replicas of the same environment. By leveraging FL, it is envisioned that we
               could expedite the training process as well as increase the stability. For example, we could selectively aggregate
               the parameters of a subset of agents with a larger potential for convergence, and later transfer the converged
               parameterstoalltheagents. Totackletheaboveproblems,severalpossiblesolutionsproposedforFLalgorithm
               containscertainreferencesignificance. Forexample, serveroperatorscouldaccountforheterogeneityinherent
   49   50   51   52   53   54   55   56   57   58   59