Page 49 - Read Online

P. 49

Qi et al. Intell Robot 2021;1(1):18-57 I http://dx.doi.org/10.20517/ir.2021.02 Page 44

virtual network functions (VNFs) into networks with local knowledge of resources and instantiation cost. In
addition, the authors also propose a loss-weight-based mechanism for generation and exploitation of refer-
ence samples for training in replay buffers, avoiding the strong relevance of each sample. Simulation results
demonstrate that SSCO can significantly reduce placement errors and improve resource utilization ratios to
place time-variant VNFs, as well as achieving desirable scalability.

Network slicing (NS) is also a form of virtual network architecture to support divergent requirements sustain-
ably. The work from Liu et al. [97] proposes a device association scheme (such as access control and handover
management) for radio access network (RAN) slicing by exploiting a hybrid federated deep reinforcement
learning (HDRL) framework. In view of the large state-action space and variety of services, HDRL is designed
with two layers of model aggregations. Horizontal aggregation deployed on BSs is used for the same type of
service. Generally, data samples collected by different devices within the same service have similar features.
The discrete-action DRL algorithm, i.e., DDQN, is employed to train the local model on individual smart
devices. BS is able to aggregate model parameters and establish a cooperative global model. Vertical aggre-
gation developed on the third encrypted party is responsible for the services of different types. In order to
promote collaboration between devices with different tasks, authors aggregate local access features to form a
global access feature, in which the data from different flows is strongly correlated since different data flows
are competing for radio resources with each other. Furthermore, the Shapley value [98] , which represents the
average marginal contribution of a specific feature across all possible feature combinations, is used to reduce
communication cost in vertical aggregation based on the global access feature. Simulation results show that
HDRL can improve network throughput and communication efficiency.

The open radio access network (O-RAN) has emerged as a paradigm for supporting multi-class wireless ser-
vices in 5G and beyond networks. To deal with the two critical issues of load balance and handover control,
Cao et al. [99] proposes a federated DRL-based scheme to train the model for user access control in the O-RAN.
Due to the mobility of UEs and the high cost of the handover between BSs, it is necessary for each UE to
access the appropriate BS to optimize its throughput performance. As independent agents, UEs make access
decisions with assistance from a global model server, which updates global DQN parameters by averaging
DQN parameters of selected UEs. Further, the scheme proposes only partially exchanging DQN parameters
to reduce communication overheads, and using the dueling structure to allow convergence for independent
agents. Simulation results demonstrate that the scheme increases long-term throughput while avoiding fre-
quent handovers of users with limited signaling overheads.

The issue of optimizing user access is important in wireless communication systems. FRL can provide inter-
esting solutions for enabling efficient and privacy-enhanced management of access control. Zhang et al. [100]
studies the problem of multi-user access in WIFI networks. In order to mitigate collision events on channel
access, anenhancedmultipleaccessmechanismbasedonFRLisproposedforuser-densescenarios. Inparticu-
lar, distributed stations traintheir local q-learning networksthrough channel state, access history and feedback
from central access point (AP). AP uses the central aggregation algorithm to update the global model every
period of time and broadcast it to all stations. In addition, a monte carlo (MC) reward estimation method for
the training phase of local model is introduced, which allocates more weight to the reward of that current state
by reducing the previous cumulative reward.

FRL is also studied for intelligent cyber-physical systems (ICPS), which aims to meet the requirements of intel-
ligent applications for high-precision, low-latency analysis of big data. In light of the heterogeneity brought by
multiple agents, the central RL-based resource allocation scheme has non-stationary issues and does not con-
sider privacy issues. Therefore, the work from Xu et al. [101] proposes a multi-agent FRL (MA-FRL) mechanism
which synthesizes a good inferential global policy from encrypted local policies of agents without revealing
private information. The data resource allocation and secure communication problems are formulated as a

44 45 46 47 48 49 50 51 52 53 54