Training off-line from the fixed logs of an external behavior policy.
Learning on the real system from limited samples.
High-dimensional continuous state and action spaces.
Safety constraints that should never or at least rarely be violated.
Tasks that may be partially observable, alternatively viewed as non-stationary or stochastic.
Reward functions that are unspecified, multi-objective, or risk-sensitive.
System operators who desire explainable policies and actions.
Inference that must happen in real-time at the control frequency of the system.
Large and/or unknown delays in the system actuators, sensors, or rewards.