Covariance Matrix Adaptation for Multiobjective Multiarmed Bandits.
Journal:
IEEE transactions on neural networks and learning systems
Published Date:
Aug 1, 2019
Abstract
Upper confidence bound (UCB) is a successful multiarmed bandit for regret minimization. The covariance matrix adaptation (CMA) for Pareto UCB (CMA-PUCB) algorithm considers stochastic reward vectors with correlated objectives. We upper bound the cumulative pseudoregret of pulling suboptimal arms for the CMA-PUCB algorithm to logarithmic number of arms K , objectives D , and samples n , O (ln(nDK) ∑ (|| Σ ||/∆)) , using a variant of Berstein inequality for matrices, where ∆ is the regret of pulling the suboptimal arm i . For unknown covariance matrices between objectives Σ , we upper bound the approximation of the covariance matrix using the number of samples to O (n ln(nDK) + ln(nDK) ∑ (1/∆)) . Simulations on a three objective stochastic environment show the applicability of our method.