The intelligent agent knew about the warehouse domain and scenario from PDDL descriptions (Listing 1 and 2). In some experiments, also all walls were described in these files, in some the intelligent agent had to learn their location via the feedback of a failed move. Please note that it did not learn whether a move was blocked by an agent or a wall, so that these other agents add noise to the observed learning data.
All experiments use the same PDDL domain file shown in Listing 1 where we describe three actions. First, move enables the intelligent agent to move as described above. Second, pickup allows it to pick items up (if it is in the same cell), and third, via put it can put an item down in a put location.
The basic structure of all our PDDL problem files is shown in Listing 2}. Depending on the example configuration, specific atoms, i.e., (connected room_X1_Y1 room_X2_Y2)}, might be omitted from the initial state such as to indicate that there is a wall between the two cells. For each fetch task, the intelligent agent receives a new random item location, which is added to the PDDL initial state on-the-fly, and a PDDL goal to bring the item to the put location. The intelligent agent then starts at (0/0) in the grid, has to go to the item's location, pick it up, go back to the put location (0/0), and, finally, deliver the item by putting it down and thus fulfilling the PDDL goal. If it is necessary to re-plan the intelligent agent will start from its current location.
We used different warehouse sizes (5x5, 8x8, or 11x11 cells) and numbers of other agents (0,1,4) in our experiments. The last parameter of a configuration is the setup of the experiment in terms of walls and an agent's a priori knowledge of them as described below. When conducting the experiments, we investigated 100 different random fetch sequences for a specific configuration and reported the average values. Each such fetch sequence consisted of 100 fetch tasks. We will also show the average performance for each of them (over the 100 runs) such as to investigate the performance increase experienced. Please note that after finishing a fetch sequence, the learned knowledge was discarded. For reason of space, we report only on a few selected configurations in this section. The results for all configurations and the code for the experiments are available on GitHub (see https://github.com/martinzimmermann/RBL-test-programs/releases/tag/CPS-RTSA). 
Shelves a priori: The grid world contains shelf cells that the agent cannot enter and around which two normal cells are placed (see Fig. \ref{213840}). Items can only be located next to a shelf. For this setup, the PDDL Problem contains the shelves' location (Lst. 2) so that an agent can move around efficiently. The challenge of this setup is that multiple agents are operating in the same warehouse. The intelligent agent knows nothing about their locations and can only learn about them by colliding with them. Still, the intelligent agent is expected to fetch items efficiently.
Shelves a posteriori: The setup is similar to the previous one, but the PDDL problem file does not contain data about the shelves. Thus, these data will be learned by the intelligent agent via the move actions' reliability for the neighboring cells such as to be able to move around in the grid efficiently (while still having to avoid other agents). 
Maze: The third setup is a randomly generated maze (see Fig. \ref{841210}), where the intelligent agent does not know the layout of the maze but has to learn it through colliding with walls. Please note that since corridors are only one cell wide, it is not possible to bypass other agents. Thus there are no other agents in this setup. 

Experimental Results