Conclusion
We showed how to adopt SFL for a live setting in order to generate a metric for catching our actions/rules' reliability or healthiness. We used the computed similarity coefficient values directly for selecting future rule sequences that are most likely to succeed in achieving our goals - following the idea that a sequence's risk of failing directly corresponds to the sum of the individual action's risk's of failing as expressed by our SFL metric. We showed how to easily compute these values dynamically (in constant time for a single rule) and how to adopt a sliding window if desired for a highly dynamic environment. Combining SFL diagnostics with a planning and execution environment like RBL enabled us to foster intelligent behavior taking the constantly observed reliability of a system's actions into account. Our experiments showed that we indeed can profit from our learning about the action's reliability. Although neither using feedback from a plan's execution to improve planning, nor using SFL for rule-bases are novel in general, combining both and adopting our concept in our context are indeed novel contributions that lead to attractive results and are (a) easy to adopt and (b) easy to compute such that it fits also applications in embedded cyber-physical systems where resources might come at a premium. Furthermore, we gave a qualitative comparison between RBL and other related approaches. With the help of this comparison the reader can deduce different trade-offs of the approaches and select the appropriate approach for his scenario.
While not reported in detail for our experiments, please note that we experimented with several similarity coefficients and found Jaccard to work best for our configurations. Future experiments investigating also sliding windows and longer learning phases will have to confirm such first trends though - also in the context of multiple scenario domains.
Further room for improvement is the tuning of the concrete values that are added to the spectrum (currently only 1 and 0). From reinforcement learning, we see that discounting rewards based on their temporal ordering positively influences the learning rate of the system. A similar approach could also be taken for the values in SFL, including exploration stages with specific strategies. That is, entirely unlimited exploration could be exploited to gather broader knowledge at the cost of performance in the tasks themselves, but also limiting the exploration to plans that deviate in performance only within some boundary ϵ to the optimal one could provide a more limited but still more educated picture. In such future research, it will also be interesting to consider effects from temporal considerations when associating the blame of a plan's failure to individual actions, or considering previous executions from the less important in the SFL spectrum as recent ones.
Acknowledgements
The financial support by the Austrian Federal Ministry for Digital and Economic Affairs and the National Foundation for Research, Technology, and Development is gratefully acknowledged.
Conflict of interest
You will be asked to provide a conflict of interest statement during the submission process. Please
check the
journal’s author guidelines for more information.