Scale-Adaptive Balancing of Exploration and Exploitation in Classical Planning
Abstract
Balancing exploration and exploitation has been an important problem in both adversarial games and automated planning. While it has been extensively analyzed in the Multi-Armed Bandit (MAB) literature, and the game community achieved a great success from MAB-based Monte Carlo Tree Search (MCTS) methods, the symbolic planning community has been struggling to achieve the same level of success so far. We identified that the detailed theoretical understanding of MAB literature is missing in the existing planning literature that uses MCTS / Trial-based Heuristic Tree Search (THTS). The use of UCB1 (Upper Confidence Bound 1) in THTS is ad-hoc because the rewards in classical planning do not satisfy the theoretical requirement of UCB1 that they must have a known bounded support shared among siblings. To address the issue, we propose a new Gaussian bandit, UCB1-Normal2, and analyze its regret bound. It is variance-aware similar to UCB1-Normal and UCB-V, but has a distinct advantage: It does not assume a known bounded support unlike UCB-V, and its regret bound does not rely on unfounded conjectures on Student and distribution as in UCB1-Normal. The analysis indicates that UCB1-Normal2 performs well when the estimated variance is accurate, which is ideal for use in a deterministic, discrete, finite state-space search in classical planning, where the empirical and the true variance could match. Our experiments suggest that MCTS combined with UCB1-Normal2 outperforms Greedy Best First Search (traditional baseline) as well as algorithms based on MCTS with other bandits.