In Fig. \ref{601366}, we report the average number of steps needed per item fetch over all 100 runs. We can see that for all given configurations, the intelligent agent in the a posteriori setup needs fewer steps over time. This confirms our hypothesis that with our SFL approach, we can use the reliability measurement to enhance planning. However, we can see that this task gets more difficult when adding more agents, as this generates more random noise. The random noise makes it harder for the agent to distinguish between temporary failures (i.e., other agents) and permanent faults (i.e., shelves).
For a priori, we can not confirm such an improvement. This is no surprise, as for this setup, the only unknown information about the world is the movement of the other agents. The movement is random and not learnable by our intelligent agent, as random behavior is, in general, not learnable. The small variance of steps needed can be explained by the random generation of the item locations. In Fig. \ref{598976} , we can see that also the minimum, maximum, and median steps needed for a priori stay consistent over time.
One of our main focus points was also to compare the a priori and a posteriori setups. Fig. \ref{601366} shows that the a posteriori setup performs worse than the a priori setup. However, the problem of solving the a posteriori setup is much harder. First, it consists of more possible actions (moves through shelves are also considered during planning for a posteriori), and second, much information about the world, i.e., the location of the shelves, is unknown to the intelligent agent. It is remarkable that for the 8x8 a posteriori configuration with 0 agents a similar performance as a priori is reached after only around 100 fetches (Fig. \ref{601366}a}). This could be due to the reason that using 0 agents makes the scenario static, although still not known by the agent. For the other configurations, we also see a strong trend toward the performance of the a priori configurations. However, in our experiments, they never reach the same performance. It is not clear if just more fetches, meaning more training data, are needed to learn to distinguish between shelves and agents, or if they will never converge toward the a priori performance. To answer this, further experiments with longer fetch sequences are necessary.
During our investigation, we could not yet explain why, for most configurations, the performance of the first few fetches gets significantly worse before the performance gets better again. The only connection we could draw was that we sometimes saw similar behavior while performing reinforcement learning in a different domain. Further research is necessary to find the root cause of this behavior. However, this was not a major concern for us, as for all configurations, in the end we performed better than the first fetch.