Page 123 - Read Online

P. 123

Page 148 Boin et al. Intell Robot 2022;2(2):14567 I http://dx.doi.org/10.20517/ir.2022.11

speed, average reward and average resource consumption [19] .

Despite the differences in FRL applications within the aforementioned studies, each study maintains a similar
goal: to improve the performance of each agent within the system. None of the aforementioned works explore
the differences in whether gradient or model weight aggregation is favourable in performance, and many of
the works apply FRL to distributed network or communications environments. It is the goal of this study to
conclude whether model weight or gradient aggregation is favourable for AV platooning, as well as be one of
the first (if not the first) to apply FRL to AV platooning.

1.1.2. Deep reinforcement learning applied to AV platooning
In recent years, there has been a surge in autonomous vehicle (AV) research, likely due to the technologies
potential for increasing road safety, traffic throughput and fuel economy [6,20] . Two areas of research are often
considered when delving into an AV model: supervised learning or RL [20] . Driving is considered a multi-
agent interaction problem, and due to the large variability of road data, it can be quite challenging (or near
impossible) to gather a data set variable enough to train a supervised model [21] . Driving data is collected from
[6]
humans, which can also limit an AI’s ability to that of human level . In contrast, RL methods are known to
generalize quite well [20] . RL approaches are model-free and a model may be inferred by the algorithm while
training.

In order to improve the limitations of vehicle following models, DRL has been a steady area of research in the
AV community, with many authors contributing works to DRL applied to CACC [8,9,22,23] . In a study by Lin
et al., a DRL framework is designed to control a CACC AV platoon [22] . The DRL framework uses the deep
deterministic policy gradient (DDPG) [24] algorithm and is found to have near-optimal performance [22] . In
addition, Peake et al. identify limitations in platooning with regard to the communication in platooning [23] .
Through the application of a multi-agent reinforcement learning process, i.e. a policy gradient RL and LSTM
network,theperformanceofaplatooncontaining3-5vehiclesisimproveduponthatofcurrentRLapplications
to platooning [23] . Furthermore, Model Predictive Control (MPC) is the current state-of-the-art for real-time
optimal control practices [25] . The study performed by Lin et al. applies both MPC and DRL methodologies
to the AV platoon problem, observing a DRL model trained using the DDPG algorithm produces merely a
5.8% episodic cost higher than the current state-of-the-art [25] . The works of Yan et al. propose a hybrid
approach to the AV platooning problem where the platoon is modeled as a Markov Decision Process (MDP)
in order to collect two rewards from the system at each time step simultaneously [26] . This approach also
incorporates jerk, the rate of change of acceleration in the calculation of the reward for each vehicle in order
to ensure passenger comfort [26] . The hybrid strategy led to increased performance to that of the base DDPG
algorithm, as the proposed framework switches between using classic CACC modeling and DDPG depending
on the performance degradation of the DDPG algorithm [26] . In another study by Zhu et al., a DRL model is
formulated and trained using DDPG to be evaluated against real world driving data. Parameters such as time
to collision, headway, and jerk were considered in the DRL model’s reward function [27] . The DDPG algorithm
provided favourable performance to that of the analysed human driving data, with regard to more efficient
driving via reduced vehicle headways, and improved passenger comfort with lower magnitudes of jerk [27] . As
Vehicle-to-Everything (V2X) communications are envisioned to have a beneficial impact on the performance
of platoon controllers, the works of Lei et al. investigates the value of V2X communications for DRL-based
platooncontrollers. Leietal. emphasizesthetrade-offbetweenthegainofincludingexogenousinformationin
the system state for reducing uncertainty and the performance erosion due to the curse-of-dimensionality [28] .

WhenformulatingtheAVplatooningproblemasaDRLmodelDDPGisprominentlyselectedasthealgorithm
for training. DDPG’s ability to handle continuous actions space and complex state’s is perfect for the CACC

118 119 120 121 122 123 124 125 126 127 128