Conference paper

Optimal Estimation of the Best Mean in Multi-Armed Bandits

Abstract

We study the problem of estimating the mean reward of the best arm in a multi-armed bandit (MAB) setting. Specifically, given a target precision epsilon and confidence level delta, the goal is to return an epsilon-accurate estimate of the largest mean reward with probability at least 1 - delta, while minimizing the number of samples. We propose an algorithm that features a novel stopping condition based on a confidence ellipsoid and a two-phase sampling strategy. The algorithm is simple, nearly free of hyperparameters, and achieves the asymptotically optimal sample complexity of 2 R^2 f(mu) log(1/delta), where R is the parameter of sub-Gaussian reward distributions, and f(mu) is the value of an instance-specific optimization problem. In contrast to the widely used Track-and-Stop algorithm, which requires solving a non-convex optimization problem at every round, our approach solves a single convex optimization problem just once. We also establish a matching lower bound, proving that no algorithm can asymptotically outperform this rate. Our analysis introduces several new techniques to address the challenge posed by the non-convexity of the characterizing optimization problem. Experimental results support our theoretical guarantees and demonstrate the practical effectiveness of the proposed method.