Page 28 - Read Online

P. 28

Ernest et al. Complex Eng Syst 2023;3:4 I http://dx.doi.org/10.20517/ces.2022.54 Page 9 of 22

For example, an input combination of 0.04 normalized target health, and 0.01 assigned attackers with its re-
sultant output can be examined, and an explanation structure can be generated from membership function
and rule labels: ”Bid output is Very High because target health is Very Low and assigned attackers is None”.
For more extensive fuzzy tree structures, this explanation can be repeated across subsequent cascaded FISs
allowing for the creation of a linguistic explanation of the entire decision process. This form of explainability
and transparency will be heavily utilized during the formal verification process as manual corrections to the
post-training model will be performed if any specifications are found to not be adhered to via formal methods.
This requires direct changing of the code of the model at all levels, not just the input or output layers. Holistic
understanding of any modifications made throughout this process are critical for any potential deployment of
the system post-modification.

2.2.4. Reinforcement learning
The standard RL process for a GFT is to first create a portfolio of training scenarios that each individual in
the GA population is evaluated over. This model was created through utilization of an open source Python
package for interfacing with SC2 such that constructive runs through these scenarios is possible [24] . Within
this study a single mission portfolio is utilized to highlight the formal verification processes, but for most
applications a portfolio containing multiple holistic scenarios as well as specific training sub-problems would
be included [11,14]

The manner in which the performance will be evaluated must also be defined through the requisite Fitness
Function for the GA. The fitness function utilized within this study is found in Equation 1.

= ( ∗ 25.0) + − − ( /100.0)
(1)

The magnitude range of this fitness function is not critical within EVE, but rather the ability for the evolution-
ary process to differentiate relative fitnesses between potential solutions, or chromosomes, in a manner that
thoroughly rewards good behavior and punishes bad. With this example, the terms specifically are a flat 50
point reward for every marine alive at the end of the scenario. This is then added to the summation of the
total friendly health remaining, including that of the siege tank, which has a notably higher health pool than
marines. This is subtracted from the hostile force health pool remaining. Finally, there is a slight penalty for
the number of timesteps it takes to complete the scenario, as if all other parameters have reached optimality,
ideally the solution executes quickly in case additional threats would be inbound to the force. This function
is able to be iterated over in future work, but serves as a good basis for the GA to evolve the population of
chromosomes.

There is an additional complexity with this particular problem due to the nature of Starcraft 2 and this training
setup; non-deterministic fitness evaluations. As the fitness value given to any chromosome within the popu-
lation will drastically affect both its probability for breeding as well as the relative worth of potential future
chromosomes, the ability for a good chromosome to be ”unlucky” and a bad one to be ”lucky” during their
respective evaluations can be damaging to the effectiveness of an evolutionary system. There are mitigating
methods, such as evaluating the scenarios, or portions of them, multiple times. Within this study, each chro-
mosomewillbeevaluatedatotalofthreetimes, withtheworstfitnessofthosethreebeingtheactualfitnessthat
is assigned, to easily mitigate the worst case risk at the expense of computational efficiency of each generation’s
evaluation.

23 24 25 26 27 28 29 30 31 32 33