Finally, also Reinforcement Learning (RL) is related to our approach. RL, as previously mentioned, is capable of solving a wide variety of tasks. Modern RL consists mostly of two intertwined research directions. First, try-and-error learning inspired by \cite{4066245}. Second, Optimal Control, which has its beginnings in the 1950s, mostly by Bellman. In \cite{bellman1954} he proposed an approach, Dynamic Programming, which could solve optimal control programs. However, this approach did not scale well for higher dimensional spaces. Sutton and Barto took inspiration from these two approaches and mixed it with temporal-difference to create the modern RL \cite{Sutton1981TowardAM}. A lot of improvements and demonstrations were made over the last few years \cite{hassabis2017,zhang2019}.
However, a core problem still remains that RL agents usually need many training samples till they are operational. Also, little research about how to bootstrap an agent with models like PDDL was done. Our approach uses a model as a bootstrap process and further refines the reliability of the available actions by try-and-error. This allows our approach to perform reasonably well right from the start without first learning how to operate in the environment. Another problem for RL is constructing a reward function, which is quite tricky to get right \cite{FaultyRewardFunctions}. In our approach, no reward function has to be defined as we only learn from action failures.

Conclusion

We showed how to adopt SFL for a live setting in order to generate a metric for catching our actions/rules' reliability or healthiness. We used the computed similarity coefficient values directly for selecting future rule sequences that are most likely to succeed in achieving our goals - following the idea that a sequence's risk of failing directly corresponds to the sum of the individual action's risk's of failing as expressed by our SFL metric. We showed how to easily compute these values dynamically (in constant time for a single rule) and how to adopt a sliding window if desired for a highly dynamic environment. Combining SFL diagnostics with a planning and execution environment like RBL enabled us to foster intelligent behavior taking the constantly observed reliability of a system's actions into account. Our experiments showed that we indeed can profit from our learning about the action's reliability. Although neither using feedback from a plan's execution to improve planning, nor using SFL for rule-bases are novel in general, combining both and adopting our concept in our context are indeed novel contributions that lead to attractive results and are (a) easy to adopt and (b) easy to compute such that it fits also applications in embedded cyber-physical systems where resources might come at a premium. Furthermore, we gave a qualitative comparison between RBL and other related approaches. With the help of this comparison the reader can deduce different trade-offs of the approaches and select the appropriate approach for his scenario.
While not reported in detail for our experiments, please note that we experimented with several similarity coefficients and found Jaccard to work best for our configurations. Future experiments investigating also sliding windows and longer learning phases will have to confirm such first trends though - also in the context of multiple scenario domains.
Further room for improvement is the tuning of the concrete values that are added to the spectrum (currently only 1 and 0). From reinforcement learning, we see that discounting rewards based on their temporal ordering positively influences the learning rate of the system. A similar approach could also be taken for the values in SFL, including exploration stages with specific strategies. That is, entirely unlimited exploration could be exploited to gather broader knowledge at the cost of performance in the tasks themselves, but also limiting the exploration to plans that deviate in performance only within some boundary ϵ to the optimal one could provide a more limited but still more educated picture. In such future research, it will also be interesting to consider effects from temporal considerations when associating the blame of a plan's failure to individual actions, or considering previous executions from the less important in the SFL spectrum as recent ones.

Acknowledgements

The financial support by the Austrian Federal Ministry for Digital and Economic Affairs and the National Foundation for Research, Technology, and Development is gratefully acknowledged.

Conflict of interest

The authors declare not to have any financial or commercial conflicts of interest.