Avinash Paidi - Authorea

In Reinforcement Learning (RL), Monte Carlo methods teach the agents how to make decisions based on the principle of averaging the sample returns. Monte Carlo Reinforcement Learning (MCRL) methods interact with an environment and estimate the value functions of states based on experiences gained from those interactions. Monte Carlo RL methods need to collect many full trajectories (sequences of states, actions, and rewards) for accurate estimation of value functions and for achieving convergence. Even if the state space of the Reinforcement Learning problem is moderately large, the number of trajectories required to estimate value functions accurately can become significantly large. Since the length of each trajectory increases as well with the increase in the size of the state space, collecting and processing each trajectory also becomes computationally expensive and time-consuming. To address this problem, in this work, we proposed a new methodology for making Monte Carlo RL methods more efficient and converge faster by accelerating the learning process. The proposed methodology performs state space abstraction by clustering the state space, which causes a significant reduction in the lengths of trajectories. A rough clustering-based approach is considered for the abstraction of state space. An agent is allowed to explore the environment and collect the trajectories for some number of episodes. Based on the collected trajectories, the preliminary topological structure of the environment is obtained and initial estimates of state value functions are calculated. Using the obtained preliminary topological structure, available estimates of state value functions, group average hierarchical agglomerative clustering is performed and the state space is divided into clusters of states. The initial estimates of state value functions and preliminary topological structure are again used for the reshuffling of states in clustered state space via rough clustering. By performing several iterations of rough clustering, better-quality clusters of states are obtained. Options (macro-actions) for traversing between different clusters of the state space are learnt using dynamic programming techniques. Once the options are learnt, at each time step of Monte Carlo learning, the agent can choose between a primitive action or an option if available. These options significantly accelerate the learning process. Options reduce the lengths of trajectories. Trajectory length reduction results in faster convergence of Monte Carlo methods, and the same is shown with the support of experimental results in this paper.