Page 39 - Read Online
P. 39

Page 20 of 22                  Ernest et al. Complex Eng Syst 2023;3:4  I http://dx.doi.org/10.20517/ces.2022.54


               system was verified against the specifications which were shown to always hold. Although simulation results
               alsoshowthatthesystemperformsasintendedintheparticularrunsexecuted,thebenefitofformalverification
               is that it will hold over any potential run with the given AI system. This is the differentiating factor for formal
               verificationcomparedtomorecommonmethodsthatrelyonnumericalevaluation. Formalverification, inthis
               case model checking and SMT solving, can give definitive evidence that the system is correct, while numerical
               evaluation can only return found counterexamples, but not conclusively identify that there are none.

               The case of specification 4 presents a unique point; adherence to certain safety specifications may not neces-
               sarily improve overall system performance. For example, there are often strict standards for use-cases that
               require certification which may overall slow down or weaken control systems, but are still necessary in order
               to deploy and trust the system. Consider laws such as speed limits placed onto a theoretical optimal driving
               controller; if the fitness function is minimizing travel time and fuel consumption, an unconstrained RL agent
               would likely optimize to illegal behaviors in some cases. However, for AI systems in mission/safety-critical
               applications the benefit of formal verification cannot be understated. In this example of specification 4, all
               friendly fire was strictly disallowed, even if it would lead to the death of the entire force.

               Regarding explainability, the structure of the GFT created is such that actions are easily explained and inter-
               pretable by humans. This is of particular interest when auditing the system for explanations on behavior that
               was learned throughout the reinforcement learning process. In this case, it was also used for a human designer
               to correct errors in the AI system. This is a powerful capability when creating complex AI control systems for
               safety-critical applications. Although SC2 AI play will likely never be considered ”safety-critical”, it serves as a
               good proxy problem for other decision-making applications in notionally similar environments.

               4.1. Future work and extensions
               Potentially the most obvious extension of this work is to make a more general AI for SC2 that plays more por-
               tions of the game including things such as: base building, resource management, etc. This would demonstrate
               theapplicabilityandefficacy ofGFTtowardsa well-knownbenchmarkproblem in RL. That couldthenbe used
               to show how explainability in complex AI systems may be desired, especially as a learning tool for humans.

               In the future, these scenarios could also be used for comparisons against other RL approaches. Due to the lack
               of explainability and formal verifiability of common RL techniques compared to GFTs, the resulting compari-
               son would be purely performance based.

               Another important extension of this work would be to extend the formal verification methodology to include
               portions of the Starcraft 2 dynamics. Doing so would allow for reasoning about higher level behaviors of the
               system. In this work, only properties about the GFT itself are verified, while inclusion of dynamic models for
               portions of Starcraft 2 would allow for interesting temporal specifications to be evaluated. E.g. if dynamics are
               included for unit damage, damage inflicted by attacks, cool downs, unit movements, etc., specifications could
               be created to reason about, say, whether it’s possible for any friendly units to be lost, or finding guaranteed
               minimums. The value of doing this is exceptional as it can provide definitive evidence of baseline performance,
               but of course is much more difficult than verification of the AI system without dynamics.



               5. CONCLUSION
               In conclusion, a GFT model was developed and trained for a particular scenario and then verified against
               four behavioral specifications. Post-training, the model did not adhere to three of these specifications, and
               counter-examples were found from the formal methods tools utilized. For these cases, the GFT was evaluated
               over scenarios within the Starcraft 2 environment which demonstrated the specific failure modes detailed in
               the counterexamples. Modifications to the system were then made for each specification until the system was
   34   35   36   37   38   39   40   41   42   43   44