Extreme Value Monte Carlo Tree Search (Extended Abstract)
Abstract
Monte-Carlo Tree Search (MCTS) combined with Multi Armed Bandit (MAB) has had limited success in domain independent classical planning until recently. Previous work (Wissow and Asai 2023) showed that UCB1, designed for bounded rewards, does not perform well when applied to the cost-to-go estimates of classical planning, which are unbounded in R, then improved the performance by using a Gaussian reward MAB instead. We further sharpen our understanding of ideal bandits for planning tasks by resolving three issues: First, Gaussian MABs under-specify the support of cost-to-go estimates as [−∞, ∞]. Second, Full-Bellman backup that backpropagates max/min of samples lacks theoretical justifications. Third, removing dead-ends lacks justifications in Monte-Carlo backup. We use Extreme Value Theory Type 2 to resolve them at once, propose two bandits (UCB1-Uniform/Power), and apply them to MCTS for classical planning. We formally prove their regret bounds and empirically demonstrate their performance in classical planning.