Vidushi Sharma, Andy Tek, et al.
NeurIPS 2025
We study the problem of estimating the mean reward of the best arm in a multi-armed bandit (MAB) setting. Specifically, given a target precision epsilon and confidence level delta, the goal is to return an epsilon-accurate estimate of the largest mean reward with probability at least 1 - delta, while minimizing the number of samples. We propose an algorithm that features a novel stopping condition based on a confidence ellipsoid and a two-phase sampling strategy. The algorithm is simple, nearly free of hyperparameters, and achieves the asymptotically optimal sample complexity of 2 R^2 f(mu) log(1/delta), where R is the parameter of sub-Gaussian reward distributions, and f(mu) is the value of an instance-specific optimization problem. In contrast to the widely used Track-and-Stop algorithm, which requires solving a non-convex optimization problem at every round, our approach solves a single convex optimization problem just once. We also establish a matching lower bound, proving that no algorithm can asymptotically outperform this rate. Our analysis introduces several new techniques to address the challenge posed by the non-convexity of the characterizing optimization problem. Experimental results support our theoretical guarantees and demonstrate the practical effectiveness of the proposed method.
Vidushi Sharma, Andy Tek, et al.
NeurIPS 2025
Robert Farrell, Rajarshi Das, et al.
AAAI-SS 2010
Benjamin Hoover, Zhaoyang Shi, et al.
NeurIPS 2025
Yuanzhe Liu, Ryan Deng, et al.
NeurIPS 2025