Some safety proposals, most notably CIRL, argue that uncertainty about the reward function will guarantee safety. We’ll look at reward uncertainty from the perspective of POMDPs, since reward uncertainty can be easily encoded as state uncertainty. We’ll see why, in principle, an optimal POMDP planning agent with reward uncertainty will engage in safe, directed exploration: avoiding doing anything catastrophic under any plausible reward function, while efficiently acquiring information about the reward function. We’ll then look at obstacles that make it very hard to derive practical guarantees from this approach: misspecification, and the guarantees’ lack of robustness to approximate inference.

Slides

L03_Value_Uncertainty.pdf

Core Readings

Supplemental Readings