Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space

Vijay Subramanian (University of Michigan)

Models of many real-life applications, such as queuing models of communication networks or computing systems, have a countably infinite state-space. Algorithmic and learning procedures that have been developed to produce optimal policies mainly focus on finite state settings, and do not directly apply to these models. To overcome this lacuna, in this work we study the problem of optimal control of a family of discrete-time countable state-space Markov Decision Processes (MDPs) governed by an unknown parameter $\theta\in\Theta$, and defined on a countably-infinite state space $\mathcal{X}=\mathbb{Z}_+^d$, with finite action space $\mathcal A$, and an unbounded cost function. We take a Bayesian perspective with the random unknown parameter $\boldsymbol{\theta}^*$ generated via a given fixed prior distribution on $\Theta$. To optimally control the unknown MDP, we propose an algorithm based on Thompson sampling with dynamically-sized episodes: at the beginning of each episode, the posterior distribution formed via Bayes’ rule is used to produce a parameter estimate, which then decides the policy applied during the episode. To ensure the stability of the Markov chain obtained by following the policy chosen for each parameter, we impose ergodicity assumptions. From this condition and using the solution of the average cost Bellman equation, we establish an $\tilde O(dh^d\sqrt{|\mathcal A|T})$ upper bound on the Bayesian regret of our algorithm, where $T$ is the time-horizon. Finally, to elucidate the applicability of our algorithm, we consider two different queuing models with unknown dynamics, and show that our algorithm can be applied to develop approximately optimal control algorithms.

Bio: Vijay Subramanian is an Associate Professor in the EECS department at the University of Michigan. After graduating, he worked in the research arm of the Networks Business Sector of Motorola in Arlington Heights, Illinois, USA until May 2006. In May 2006, he returned to an academic setting. He started with a move to the Hamilton Institute of the National University of Ireland, Maynooth as a Research Fellow. In the summer of 2010, he was a visiting researcher at LIDS MIT. From November 2010 to October 2011, he was a Senior Research Associate in the EECS Department at Northwestern University. From November 2011 until August 2014, he was a Research Assistant Professor in the EECS Department at Northwestern University.

His main research interests are in stochastic modeling, communications, information theory and applied mathematics. A large portion of his past work has been on probabilistic analysis of communication networks, especially analysis of scheduling and routing algorithms. In the past, he has also done some work with applications in immunology and coding of stochastic processes. His current research interests are on game theoretic and economic modeling of socio-technological systems and networks, and the analysis of associated stochastic processes.