Page 41 - Read Online
P. 41

Qi et al. Intell Robot 2021;1(1):18-57  I http://dx.doi.org/10.20517/ir.2021.02      Page 36


                                                                         [7]
               the experience of agent without considering privacy protection issues . In the implementation of HFRL, fur-
               ther restrictions accompany privacy protection and communication consumption to adapt to special scenarios,
               such as IoT applications [59] . In addition, another point to consider is Non-IID data. In order to ensure con-
               vergence of the RL model, it is generally assumed in parallel RL that the states transitions in the environment
               follow the same distribution, i.e., the environments of different agents are IID. But in actual scenarios, the
               situation faced by agents may differ slightly, so that the models of environments for different agents are not
               identically distributed. Therefore, HFRL needs to improve the generalization ability of the model compared
               with parallel RL to meet the challenges posed by Non-IID data.

               Based on the potential issues faced by the current RL technology, the advantages of HFRL can be summarized
               as follows.

                • Enhancing training speed. In the case of a similar target task, multiple agents sharing training experiences
                  gained from different environments can expedite the learning process. The local model rapidly evolves
                  through aggregation and update algorithms to assess the unexplored environment. Moreover, the data ob-
                  tained by different agents are independent, reducing correlations between the observed data. Furthermore,
                  this also helps to solve the issue of unbalanced data caused by various restrictions.
                • Improving the reliability of model. When the dimensions of the state of the environment are enormous or
                  even uncountable, it is difficult for a single agent to train an optimal strategy for situations with extremely
                  low occurrence probabilities. Horizontal agents are exploring independently while building a cooperative
                  model to improve the local model’s performance on rare states.
                • Mitigating the problems of devices heterogeneity. Different devices deploying RL agents in the HFRL ar-
                  chitecture may have different computational and communication capabilities. Some devices may not meet
                  the basic requirements for training, but strategies are needed to guide actions. HFRL makes it possible for
                  all agents to obtain the shared model equally for the target task.
                • Addressing the issue of non-identical environment. Considering the differences in the environment dy-
                  namics for the different agents, the assumption of IID data may be broken. Under the HFRL architecture,
                  agents in not identically-distributed environment models can still cooperate to learn a federated model. In
                  order to address the difference in environment dynamics, a personalized update algorithm of local model
                  could be designed to minimize the impact of this issue.
                • Increasing the flexibility of the system. The agent can decide when to participate in the cooperative system
                  at any time, because HFRL allows asynchronous requests and aggregation of shared models. In the existing
                  HFRL-based application, new agents also can apply for membership and benefit from downloading the
                  shared model.


               4.3. Vertical federated reinforcement learning
               In VFL, samples of multiple data sets have different feature spaces but these samples may belong to the same
               groupsorcommonusers. Thetrainingdataofeachparticipantaredividedverticallyaccordingtotheirfeatures.
               More general and accurate models can be generated by building heterogeneous feature spaces without releas-
               ing private information. VFRL applies the methodology of VFL to RL and is suitable for POMDP scenarios
               where different RL agents are in the same environment but have different interactions with the environment.
               Specifically, different agents could have different observations that are only part of the global state. They could
               take actions from different action spaces and observe different rewards, or some agents even take no actions
               or cannot observe any rewards. Since the observation range of a single agent to the environment is limited,
               multiple agents cooperate to collect enough knowledge needed for decision making. The role of FL in VFRL
               is to aggregate the partial features observed by various agents. Especially for those agents without rewards, the
               aggregation effect of FL greatly enhances the value of such agents in their interactions with the environment,
               and ultimately helps with the strategy optimization. It is worth noting that in VFRL the issue of privacy pro-
               tection needs to be considered, i.e., private data collected by some agents do not have to be shared with others.
               Instead, agents can transmit encrypted model parameters, gradients, or direct mid-product to each other. In
   36   37   38   39   40   41   42   43   44   45   46