Page 123 - Read Online
P. 123

Page 148                         Boin et al. Intell Robot 2022;2(2):145­67  I http://dx.doi.org/10.20517/ir.2022.11

               speed, average reward and average resource consumption [19] .

               Despite the differences in FRL applications within the aforementioned studies, each study maintains a similar
               goal: to improve the performance of each agent within the system. None of the aforementioned works explore
               the differences in whether gradient or model weight aggregation is favourable in performance, and many of
               the works apply FRL to distributed network or communications environments. It is the goal of this study to
               conclude whether model weight or gradient aggregation is favourable for AV platooning, as well as be one of
               the first (if not the first) to apply FRL to AV platooning.



               1.1.2. Deep reinforcement learning applied to AV platooning
               In recent years, there has been a surge in autonomous vehicle (AV) research, likely due to the technologies
               potential for increasing road safety, traffic throughput and fuel economy [6,20] . Two areas of research are often
               considered when delving into an AV model: supervised learning or RL [20] . Driving is considered a multi-
               agent interaction problem, and due to the large variability of road data, it can be quite challenging (or near
               impossible) to gather a data set variable enough to train a supervised model [21] . Driving data is collected from
                                                                       [6]
               humans, which can also limit an AI’s ability to that of human level . In contrast, RL methods are known to
               generalize quite well [20] . RL approaches are model-free and a model may be inferred by the algorithm while
               training.

               In order to improve the limitations of vehicle following models, DRL has been a steady area of research in the
               AV community, with many authors contributing works to DRL applied to CACC [8,9,22,23] . In a study by Lin
               et al., a DRL framework is designed to control a CACC AV platoon [22] . The DRL framework uses the deep
               deterministic policy gradient (DDPG) [24]  algorithm and is found to have near-optimal performance [22] . In
               addition, Peake et al. identify limitations in platooning with regard to the communication in platooning [23] .
               Through the application of a multi-agent reinforcement learning process, i.e. a policy gradient RL and LSTM
               network,theperformanceofaplatooncontaining3-5vehiclesisimproveduponthatofcurrentRLapplications
               to platooning [23] . Furthermore, Model Predictive Control (MPC) is the current state-of-the-art for real-time
               optimal control practices [25] . The study performed by Lin et al. applies both MPC and DRL methodologies
               to the AV platoon problem, observing a DRL model trained using the DDPG algorithm produces merely a
               5.8% episodic cost higher than the current state-of-the-art [25] . The works of Yan et al. propose a hybrid
               approach to the AV platooning problem where the platoon is modeled as a Markov Decision Process (MDP)
               in order to collect two rewards from the system at each time step simultaneously [26] . This approach also
               incorporates jerk, the rate of change of acceleration in the calculation of the reward for each vehicle in order
               to ensure passenger comfort [26] . The hybrid strategy led to increased performance to that of the base DDPG
               algorithm, as the proposed framework switches between using classic CACC modeling and DDPG depending
               on the performance degradation of the DDPG algorithm [26] . In another study by Zhu et al., a DRL model is
               formulated and trained using DDPG to be evaluated against real world driving data. Parameters such as time
               to collision, headway, and jerk were considered in the DRL model’s reward function [27] . The DDPG algorithm
               provided favourable performance to that of the analysed human driving data, with regard to more efficient
               driving via reduced vehicle headways, and improved passenger comfort with lower magnitudes of jerk [27] . As
               Vehicle-to-Everything (V2X) communications are envisioned to have a beneficial impact on the performance
               of platoon controllers, the works of Lei et al. investigates the value of V2X communications for DRL-based
               platooncontrollers. Leietal. emphasizesthetrade-offbetweenthegainofincludingexogenousinformationin
               the system state for reducing uncertainty and the performance erosion due to the curse-of-dimensionality [28] .

               WhenformulatingtheAVplatooningproblemasaDRLmodelDDPGisprominentlyselectedasthealgorithm
               for training. DDPG’s ability to handle continuous actions space and complex state’s is perfect for the CACC
   118   119   120   121   122   123   124   125   126   127   128