The ambition of this web page is to state, refine, clarify and, most of all, promote discussion of, the following scientific hypothesis:
That all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward).
Is this true? False? A definition? Unfalsifiable? You are encouraged to comment on the hypothesis, even in minimal ways. For example, you might submit an extension "Yes" to indicate that you believe the hypothesis, or similarly "No" or "Not sure". These minimal responses will be collected and tallied at some point, and you may want to change yours later, so please include your name in some way.
This is my favorite "null hypothesis", so much so that I sometimes call it simply the null hypothesis. It feels essential to take a position on this very basic issue before one can talk clearly and sensibly about so much else.
Michael Littman calls this the reinforcement learning hypothesis. That name seems appropriate because it is a distinctive feature of reinforcement learning that it takes this hypothesis seriously. Markov decision processes involve rewards, but only with the onset of reinforcement learning has reward maximization been put forth seriously as a reasonable model of a complete intelligent agent analogous to a human being.
I hold that it would be difficult to refute the reward hypothesis, especially from a biological perspective. It can be seen that most living systems are formed, develop, and act based on a small set of base signals (nervous response, immune activation, chemical stimulus, ...) It can also be argued that most (if not all) of these base signals strive to maximize a single characteristic(signal); the survivability of the organism.
I think the interesting question lies in the identification of the base reward signal. Is the final reward a level of conscious "happiness" or excitation in the agent, or is it a survival value built in to the system far below the level of cognitive processes (i.e. the state and response of molecular machinery)? Broken down far enough, is the reward for each component part simply "I survived" or "I didn't", where the system aims to maximize the summed long-term survival of its component parts?
I am skeptical as to whether reward can be thought of solely in terms of survival value; consider the example of the martyr who would rather take the reward that comes with sacrificing his life, over the reward of living another day. Rather, it just happens that the rewards we receive tend to correlate with the survival value of a given state (if they didn't, we wouldn't be alive to talk about this)!
The forces of nature don't care how their populations survive, only that they do survive; it would be my guess the concept of reward (in organic creatures) is probably influenced by many different forces - both cognitive and simpler, basic drives - all working in tandem.
I came to the conclusion that the rewards you give the agent are a crucial, delicate matter. Unless you set the rewards to be 1 for winning and 0 otherwise, as in backgammon, or maybe -1 for each time step and 0 for the goal, it can get very messy. There are a lot of issues arising, like trading between time, fuel, car damage and human damage while learning to drive a car. I am really wondering if there is any sound approach that uses complex reward functions.
So my questions would be: is every success in reinforcement learning based on very simple reward functions? How would one go about designing the reward function for a complex application, such as driving cars? (or even more challenging, and I am sure that's one of the things DARPA is after - robot marines - what's the tradeoff between killing a civilian, missing an enemy and getting hit yourself?).
Maybe one way to approach this problem would be to give the agent a grade at the end of each episode based on how pleased we are with its performance. But this would imply a lot of human work - especially for on-line methods, which need a lot of samples.
Here are today's questions related to the reward hypothesis:
0. Rich says that it is user's worry to specify a reward function since the reward function is a part of the problem. It is then an RL researcher's duty to come up with an RL algorithm to learn a policy that maximizes the reward.
1. It seems to me that many (most?) "real life" tasks we deal with do not immediately involve any reward values. For instance, there is no mention of a scalar signal in any of the following problems: "get from Edmonton to Los Angeles", "find a better job", "marry a nice woman", "be happy", "buy good food", "drive safely", "catch a 5pm plane", "play backgammon well", "fly a helicopter inverted", "make an AIBO run fast", etc. The reward signal was imposed on them by RL researchers (e.g., Tesauro, Ng, Stone, Velloso, et al.). Thus, it is natural to wonder if requiring the client to provide a single scalar signal that captures everything about their problem is reasonable.
2. How can it possibly be unreasonable? After all, we, as the authors of RL algorithms, have the right to introduce consistent assumptions on the problem formulation at will, do we not? Well, here is a outrageously ridiculous example of an unreasonable problem specification requirement. Do you want to have a simple and guaranteed method to solve any problem in AI? Well, I am going to give it to you free of charge. As the author of this method, I will require that the solution is a part of the problem and is provided to the algorithm as such. Then my algorithm simply takes the solution out of the input and outputs it to the user ;-)
3. Thus, it is reasonable to ask what requirements on the problem specification one may impose. I think what we are missing here is a standard language to describe AI problems. Rich asked me today "How else can we specify a problem if not via a reward function?" In the last half of the previous century, a similar question was asked by Turing, Church, von Neuman et al., ---- How can one describe a general computational problem? An answer found there lied with checking membership in a formal language (the so-called "decision problem"). Recursion Theory builds a large theory around recursive sets (whose members can be checked by a Turing machine), recursively enumerable sets (whose members can be checked by a Turing machine with the set K as an oracle), etc.
5. Do we need a different language to represent AI problems? Or should we reason that any AI agent is a computational device and therefore AI problems should be viewed merely as computational problems and formalized through the formal languages as all other computational problems?
P.S. As a side note: How much of the solution does a reward function contain? Since infinitely many reward functions lead to the same policy, are we, as the designers of the RL algorithms, equally satisfied with any such reward function? Or do we want "better" reward function that will, for instance, speed up learning? [e.g., Andrew Ng was adding hand-engineered shaping rewards]. Then aren't we asking the user to help us solve the problem?
Notice that the reward hypothesis is inconsistent with the belief that somehow concern for risk, say about minimizing the worst that can happen, makes for a different problem. The hypothesis implies that all such cases can be reduced to maximizing expected cumulative reward for some choice of reward. Thus the hypothesis may be at odds with the fact that there exist large bodies of research that treat risk-sensitive planning as a special case. Down with special cases!!
I come at this from the standpoint of an M.B.A. with a focus in finance. It seems to me that your view of goals and purposes is analogous to our concept of net present value in the context of capital rationing.
Under the net present value methodology we use here, we rank projects to fund (decisions to take) based on which will provide the highest net present value of a future stream of cash flows (rewards). If we were not concerned with the time value of rewards, that is, with prefering rewards closer to the present, then we would be indifferent between a decision path that stacks most reward at the end of the time horizon verses the beginning, assuming the total value of the rewards is equal. But since we want our rewards as soon as possible, we prefer the path that stack the rewards closer to the present.
This is a rational economic arguement that seems to share commonalities with RL.
It sounds like Al Paris is thinking about the case of conventional discounted reward, as in http://www.cs.ualberta.ca/%7Esutton/book/ebook/node30.html#eq:return. This seems to be the standard case in economics.
I can say 'no' to this null hypothesis but can't be confident on 'yes' either. "maximization of the expected value of the cumulative sum of a received scalar signal" ... Why does it have to be 'scalar'? Can the goal be the multi-critiria? like we want a hotel that is cheap, near attraction, and safe. Should we combine them into scalar? What if the super cheap hotel is not safe?
Any 'goals' or 'purposes' must be implicitly represented by our behavior, which is modelled by the policy. So I believe that technically the question is equivalent to finding a reward function to any possible policy (that is, given a policy, find a reward of which optimization results in the same policy). It is trivially possible when the Markov assumption holds - just immediately penalize actions which are not in the behavior. Assuming that the variables of the human mind are entirely observable (including its memory), all human behavior should have an associated reward function given the suitable representation.
This technical remark does not say anything about the explaining power of rewards. Maybe we are able to define a reward, but we gain no deeper insight as it may become as complicated as the behavior itself.
So technically the answer is perhaps 'yes', but the more important question goes rather: is it always beneficial to describe goals as rewards in practice? Probably not.