Page 39 - Read Online

P. 39

Page 20 of 22 Ernest et al. Complex Eng Syst 2023;3:4 I http://dx.doi.org/10.20517/ces.2022.54

system was verified against the specifications which were shown to always hold. Although simulation results
alsoshowthatthesystemperformsasintendedintheparticularrunsexecuted,thebenefitofformalverification
is that it will hold over any potential run with the given AI system. This is the differentiating factor for formal
verificationcomparedtomorecommonmethodsthatrelyonnumericalevaluation. Formalverification, inthis
case model checking and SMT solving, can give definitive evidence that the system is correct, while numerical
evaluation can only return found counterexamples, but not conclusively identify that there are none.

The case of specification 4 presents a unique point; adherence to certain safety specifications may not neces-
sarily improve overall system performance. For example, there are often strict standards for use-cases that
require certification which may overall slow down or weaken control systems, but are still necessary in order
to deploy and trust the system. Consider laws such as speed limits placed onto a theoretical optimal driving
controller; if the fitness function is minimizing travel time and fuel consumption, an unconstrained RL agent
would likely optimize to illegal behaviors in some cases. However, for AI systems in mission/safety-critical
applications the benefit of formal verification cannot be understated. In this example of specification 4, all
friendly fire was strictly disallowed, even if it would lead to the death of the entire force.

Regarding explainability, the structure of the GFT created is such that actions are easily explained and inter-
pretable by humans. This is of particular interest when auditing the system for explanations on behavior that
was learned throughout the reinforcement learning process. In this case, it was also used for a human designer
to correct errors in the AI system. This is a powerful capability when creating complex AI control systems for
safety-critical applications. Although SC2 AI play will likely never be considered ”safety-critical”, it serves as a
good proxy problem for other decision-making applications in notionally similar environments.

4.1. Future work and extensions
Potentially the most obvious extension of this work is to make a more general AI for SC2 that plays more por-
tions of the game including things such as: base building, resource management, etc. This would demonstrate
theapplicabilityandefficacy ofGFTtowardsa well-knownbenchmarkproblem in RL. That couldthenbe used
to show how explainability in complex AI systems may be desired, especially as a learning tool for humans.

In the future, these scenarios could also be used for comparisons against other RL approaches. Due to the lack
of explainability and formal verifiability of common RL techniques compared to GFTs, the resulting compari-
son would be purely performance based.

Another important extension of this work would be to extend the formal verification methodology to include
portions of the Starcraft 2 dynamics. Doing so would allow for reasoning about higher level behaviors of the
system. In this work, only properties about the GFT itself are verified, while inclusion of dynamic models for
portions of Starcraft 2 would allow for interesting temporal specifications to be evaluated. E.g. if dynamics are
included for unit damage, damage inflicted by attacks, cool downs, unit movements, etc., specifications could
be created to reason about, say, whether it’s possible for any friendly units to be lost, or finding guaranteed
minimums. The value of doing this is exceptional as it can provide definitive evidence of baseline performance,
but of course is much more difficult than verification of the AI system without dynamics.

5. CONCLUSION
In conclusion, a GFT model was developed and trained for a particular scenario and then verified against
four behavioral specifications. Post-training, the model did not adhere to three of these specifications, and
counter-examples were found from the formal methods tools utilized. For these cases, the GFT was evaluated
over scenarios within the Starcraft 2 environment which demonstrated the specific failure modes detailed in
the counterexamples. Modifications to the system were then made for each specification until the system was

34 35 36 37 38 39 40 41 42 43 44