Page 132 - Read Online
P. 132

Boin et al. Intell Robot 2022;2(2):145­67  I http://dx.doi.org/10.20517/ir.2022.11  Page 157

                        Table 4. Performance after training across 4 random seeds. Each simulation result contains 600 time steps
                          Training method  Seed 1  Seed 2  Seed 3  Seed 4  Average system reward  Standard deviation
                          No-FRL        -3.73  -2.89  -4.69  -3.38           -3.67          0.66
                          Inter-FRLGA   -2.79  -2.81  -3.05  -2.76           -2.85           0.11
                          Inter-FRLWA   -2.64  -2.88  -2.92  -2.93           -2.84          0.12



               Tensorflow. As previously stated, each vehicle in the platoon is modelled using the CACC CTHP model de-
               scribed in Section 3. For the purposes of this study, multiple sets of DRL experiments were conducted, using
               4 random seeds (1-4) for training and a single random seed (6) across all evaluations.



               3.2. Inter­FRL
               In order to evaluate the effectiveness of Inter-FRL relative to the base case where a DRL model is trained using
               DDPG without FRL, 4 experiments are conducted without Inter-FRL (no-FRL), and 8 with. For each of the 12
               conductedexperiments, 2platoonswith2vehicleseachweretrainedusingoneofthefourrandomseeds. Once
               training acrossthe fourseeds hascompleted, the cumulative reward for a single evaluation episode is evaluated.
               For the experiments using Inter-FRL, two aggregation methods are examined. First, the gradients of each
               model are averaged during training, and second, the model weights are averaged. The multi platoon system
               trainsandsharestheaggregatedparameters(gradientsorweights)amongstvehicleswiththesameindexacross
               platoons. Thefederatedserverisresponsibleforperformingtheaveraging,andeachvehicleperformsatraining
               episode with the averaged parameters in addition to their local training episodes in accordance with the FRL
               update delay and FRL cutoff ratio (see Table 3). Note that here-after Inter-FRL with gradient aggregation is
               denoted Inter-FRLGA, and Inter-FRL with weight aggregation is denoted Inter-FRLWA.



               3.2.1. Performance across 4 random seeds
               The performance for each of the systems is calculated by averaging the cumulative reward of each vehicle in
               the 2 vehicle 2 platoon system, as summarized in Table 4. For each of the 3 cases (base case, Inter-FRLGA and
               Inter-FRLWA),trainingsessionswererun using4 randomseeds. In ordertodeterminethehighest performing
               system overall, an average and standard deviation is obtained from the result of training using the 4 random
               seeds. FromTable4,itisobservedthatbothInter-FRLscenariosusinggradientandweightaggregationprovide
               large performance increases to that of the base case.



               3.2.2. Convergence properties
               The cumulative reward is calculated over each training episode, and a moving average is computed over 40
               episodes to generate Figure 6a-6f. It can be seen that the cumulative reward for Inter-FRLWA not only con-
               verges more rapidly than both no-FRL and Inter-FRLGA, but Inter-FRLWA also appears to have a more stable
               training session as indicated by the lower magnitude of the shaded area (the standard deviation across the four
               random seeds).



               3.2.2. Test results for one episode
               In Figure 7a and 7b, a simulation is performed over a single training episode plotting the jerk, along with the
               control input      ,  , acceleration      ,  , velocity error        ,  , and position error        ,   for each platoon. There are 2
   127   128   129   130   131   132   133   134   135   136   137