[Submitted on 17 Sep 2020 (v1), last revised 25 Oct 2020 (this version, v2)]
Abstract: We considered a novel practical problem of online learning with episodically
revealed rewards, motivated by several real-world applications, where the
contexts are nonstationary over different episodes and the reward feedbacks are
not always available to the decision making agents. For this online
semi-supervised learning setting, we introduced Background Episodic Reward
LinUCB (BerlinUCB), a solution that easily incorporates clustering as a
self-supervision module to provide useful side information when rewards are not
observed. Our experiments on a variety of datasets, both in stationary and
nonstationary environments of six different scenarios, demonstrated clear
advantages of the proposed approach over the standard contextual bandit.
Lastly, we introduced a relevant real-life example where this problem setting
is especially useful.