We study a multi-armed bandit problem with dependent arms. The decision maker dynamically chooses an arm out of finitely many arms to maximize her total expected discounted net benefit over an infinite time horizon. Pulling each arm incurs a particular cost and provides a certain reward. Arms’ rewards are dependent on each other through a common parameter unknown to the decision maker. Thus, by pulling one arm, the decision maker also collects information about the other arms. Using a continuous-time formulation, we devise a novel tree-based construction approach and explicitly characterize an optimal policy for this problem. Our analysis reveals that a dynamic efficient frontier policy is optimal. Specifically, there exists a dynamic efficient frontier in the space of immediate expected benefit and information quality of controls such that under the optimal policy, the decision maker only uses a control on the realized efficient frontier. Distinctively, this efficient frontier is not static; it changes with the accumulated information summarized by the decision maker’s posterior belief. Moreover, it is characterized solely in terms of problem parameters without additional computation. We observe that relative to the optimal policy, the greedy policy-a common heuristic-can significantly underperform in this setting.
Note: Research papers posted on SSRN, including any findings, may differ from the final version chosen for publication in academic journals.