The Stanford marshmallow experiment from a machine learning perspective

The Stanford marshmallow experiment was a psychological study on delayed gratification. It was conducted in 1970 by a team around psychologist Walter Mischel at Stanford University.

The experiment setup and outcome was as follows (from Wikipedia):

In this study, a child was offered a choice between one small but immediate reward, or two small rewards if they waited for a period of time. During this time, the researcher left the child in a room with a single marshmallow for about 15 minutes and then returned. If they did not eat the marshmallow, the reward was either another marshmallow or pretzel stick, depending on the child's preference.
In follow-up studies, the researchers found that children who were able to wait longer for the preferred rewards tended to have better life outcomes, as measured by SAT scores, educational attainment, body mass index (BMI), and other life measures.

(Please check the Wikpedia page for details and references)

The interpretation of the result seems to be simple: children who are able to resist the temptation of instant gratification will be more successful in the future. We all know that this is at the core of the concept we call discipline or self-control. The result seems to confirm our existing knowledge about character traits which lead people to success.

In this series of posts, I want to study the experiment in detail with three goals in mind:

To show that the outcome can be explained in a surprisingly large number of ways. Only some of them indicate inferior cognitive performance of the kids who eat the marshmallow immediately.
To show that a more formal paradigm (like the one used in machine learning - reinforcement learning in particular) helps to identify more possible causes for an observed behavior.
More importantly, I want to use the experiment later as a case study to develop a more realistic formal paradigm for human like artificial intelligence. The well researched and successful reinforcement learning (RF in the following) paradigm from machine learning can be related to functionalities in the human brain („dopamine hypothesis“: dopamine as reward signal etc.). But some modifications / extensions will be required to achieve a really good match.

I will focus on the first two goals in this blog post and write about the third one in a later post.

Let’s collect as many reasons as possible why a kid might eat the marshmallow immediately. I will loosely use RF terminology. This cannot work 100% well, but will help to structure the different reasons.

Failure in return prediction: the standard interpretation: the kid fails to realize that it can double its return (the sum of current and future rewards) by waiting only 15 minutes. Such a failure could be indeed attributed to a lower intelligence. Of course this would correlate to worse life outcomes.
Stochastic environment - accurate world model: the standard critique of the experiment: the kid does not trust the experimenter. In machine learning this corresponds to a stochastic environment. The transition to the state after 15minutes with the kid receiving a second marshmallow happens only with a certain transition probability. Now there are two cases to consider: first, the environment is stochastic (i.e. its parents - or other contact persons in the past - cannot be trusted, e.g. because they are psychically unstable or drug addicts). The decision to eat the marshmallow immediately would then be completely rational (the experimenter could take the first marshmallow away!). But also this case would be correlated with a worse life outcome: such unreliable parents (or social environment in general) are likely not very good parents in other aspects.
Stochastic environment - inaccurate world model or failure to interpret correct world model: In the second case the social environment of the child was predominantly reliable in the past, but the kid overestimates the uncertainty of the transition to the state receiving a second reward. This again would be an inferior cognitive performance leading to worse life outcome. But the experimental setup with an unknown experimenter is unusual for the kid (it’s not the mother offering the treats at home). Therefore the decision to estimate the transition probability as low could be rational even in this case.
Rewards not identical: the kid estimates that the second reward will be in fact zero or even negative: if the kid knows that it cannot eat so many marshmallows in 15 minutes, it will not be keen on marshmallows after eating the first one (i.e. zero reward) or even likely get stomachache from eating the second one (i.e. negative reward) [1]. In both cases eating the marshmallow immediately is perfectly rational.
Curiosity / exploration: the whole setup of such an experiment is very unusual for a child. It might therefore decide to explore the environment (i.e. learn, which could increase long term rewards) instead of simply exploit it (maximize short time reward). This might trigger unexpected and seemingly irrational behavior (like not following adult advice) and would be again completely rational.
Expected change of internal state / world model: After eating the first marshmallow (and therefore knowing more about it), the second one might offer lower, zero or even negative reward (the first one was not as tasty as expected or even disgusting, because it was spoiled or the wrong flavor or brand). The more the kid expects the marshmallow to be not very tasty the more likely it will decide to try it immediately: in this case the child tries the food rather to learn/explore (which can be done immediately) than to eat/exploit.
Expected change of (internal) reward function (i.e. goals): eating is an important goal for humans. But, as the experiment shows, it can - other than e.g. self preservation - be delayed or suppressed. Therefore the kid might have decided to go on diet in exactly 10minutes. It could then not enjoy the two treats which arrive later. This is admittedly rather theoretical, but mentioned for completeness.

We see know that the experiment deserves a rigorous analysis (which my blog post ist still not). It is difficult to draw conclusions from a single experiment. Some of the cases listed above have been already explored in follow-up experiments.

[1] Also know as diminishing marginal utility* in economics

Image made with DALL-E

Follow me on X to get informed about new content on this blog.

I don’t like paywalled content. Therefore I have made the content of my blog freely available for everyone. But I would still love to invest much more time into this blog, which means that I need some income from writing. Therefore, if you would like to read articles from me more often and if you can afford 2$ once a month, please consider supporting me via Patreon. Every contribution motivates me!