Page 42 - Read Online
P. 42

Page 37                                                                  Qi et al. Intell Robot 2021;1(1):18-57  I http://dx.doi.org/10.20517/ir.2021.02


                                             Actions
                                                  Vertical Federated
                                                Reinforcement Learning  Agent A
                                                                                Actions A

                                                            Agent B
                                                                    Actions B




                                             Agent N
                                                       Actions N

                                                                                 States
                                          Observed by N  Observed by B  Observed by A
                                      Global
                                    Environment

                                       Figure 11. Illustration of vertical federated reinforcement learning.


               short, the goal of VFRL is for agents interacting with the same environment to improve the performance of
               their policies and the effectiveness in learning them by sharing experiences without compromising the privacy.

                                             as    agents in VFRL, which interact with a global environment E. The   -th
               More formally, we denote {F    }
                                           =1
               agent F    is located in the environment E    = E, obtains the local partial observation O   , and can perform the set
               of actions A   . Different from HFRL, the state/observation and action spaces of two agents F    and F    may be
               not identical, but the aggregation of the state/observation spaces and action spaces of all the agents constitutes
               the global state and action spaces of the global environment E. The conditions for VFRL can be defined as i.e.,



                                                                   
                                                       ∪        ∪
                           O    ≠ O    , A    ≠ A    , E    = E    = E,  O    =S,  A    =A, ∀  ,    ∈ {1,2,...,  } ,    ≠   ,
                                                         =1       =1


               where S and A denote the global state space and action space of all participant agents respectively. It can be
               seen that all the observations of the    agents together constitute the global state space S of the environment
               E. Besides, the environments E    and E    are the same environment E. In most cases, there is a great difference
               between the observations of two agents F    and F    .

               Figure 11 shows the architecture of VFRL. The dataset and feature space in VFL are converted to the envi-
               ronment and state space respectively. VFL divides the dataset vertically according to the features of samples,
               and VFRL divides agents based on the state spaces observed from the global environment. Generally speak-
               ing, every agent has its local state which can be different from that of the other agents and the aggregation of
               these local partial states corresponds to the entire environment state [65] . In addition, after interacting with the
               environment, agents may generate their local actions which correspond to the labels in VFL.

               TwotypesofagentscanbedefinedforVFRL,i.e.,decision-orientedagentsandsupport-orientedagents. Decision-
                                                                                                
               oriented agents {F    }   =1  can interact with the environment E based on their local state {S    }   =1  and action
                                                            
               {A    } . Meanwhile, support-oriented agents {F    }  take no actions and receive no rewards but only the
                     =1                                     =  +1
                                                                  
               observations of the environment, i.e., their local states {S    }   =  +1 . In general, the following six steps, as shown
               in Figure 12, are the basic procedure for VFRL, i.e.,
   37   38   39   40   41   42   43   44   45   46   47