How great minds are made (spoiler: school is not helpful)

“Genius is the ability to independently arrive at and understand concepts that would normally have to be taught by another person“ - Immanuel Kant

This sounds logical. The correlation is surely true: we know that geniuses don’t need much teaching. But what if the causation is actually the other way round? What if geniuses perform so exceptionally well because they started to learn early by themselves (instead of being forced to learn by some teaching institution)? What if „not teaching actively“ is an important (if maybe not sufficient) aspect of „making geniuses“?

In this blog post I will discuss some ideas - based on recent results from machine learning research - why the second interpretation seems to be indeed much more convincing.

We know that knowledge must be acquired in a certain order. You cannot learn algebra if you have not yet understood numbers. And you cannot teach physics without some knowledge in algebra, or English literature if you don’t know the letters of the alphabet.
This is equally true for learning machines. It seems that the performance of an AI can be increased by using the training data in a certain order. In machine learning, this method is known under the term „curriculum learning“. Let’s have a closer look why this might yield better results compared to a random presentation of the material to the learning model. With some luck we can infer some insights about learning in humans.

For the following discussion I will use a very simple problem as an example. Let’s assume that we want to learn how to differentiate between dogs and cats [1]. But we don’t do this by using fotos or other high dimensional data but only from two feature variables: the size (like shoulder height in cm) and the number of stripes on the animals fur (like counted on a horizontal line through the middle of the body).

The available training data might look like this:

chart of training data

Note that I introduced some extra complexity by including a few brindle dogs.

Now this data is available for training. Let’s assume now that we know only the data points (i.e. size and number of stripes) for each animal, but not the labels (i.e. whether they are dogs or cats):

chart of training data without labels

Our job is now to learn from this data. This means that, after the learning process, we should be able to differentiate between cats and dogs only from their size/stripes data. We do this by revealing the labels of the available data points one by one. We should also try to learn efficiently. This means that we should reach the goal by using as few training data points as possible (revealing a data point is usually associated with some cost: the effort for an experiment - like finding animals to measure out - or the effort of the learning process itself).

Note that, even if this example is extremely simple, the fundamental principles remain the same even in a 10 billion parameter „Large Language Model“ (and presumably also in our brain built from biological neurons).

From a single data point we cannot draw conclusions. So let’s reveal two random points:

chart of training data, two points revealed

Now we can use these two points to create a first version of our (machine) learning model: we can create a decision boundary by drawing the perpendicular bisector between the data points (left side in the following image). Technically this could be implemented by an artificial neuron (or any other machine learning model architecture [2], the basic idea is always the same). Now we can also visualize the assumptions of the very first version of our model regarding all the data points (its world model, right side):

chart of training data, with decision boundary

Obviously it’s not very good yet, as many points still get misclassified.

Now let’s pick the next data point randomly:

another data point

This data point gets misclassified by the current model! Therefore the model needs to be refined. We therefore need to create another decision boundary:

more decision boundaries

We can do this with a second artificial neuron.

Now let’s choose a 4th data point randomly. This one too gets misclassified even by our refined model and we must create one more decision boundary:

even more

And unfortunately the process repeats with the 5th data point:

even more!

Now we have finally arrived at a model which can correctly classify all the training data.
Is this a good result?
Not really. If we would have chosen different data points in the beginning, we could have reached our goal much faster. In the following example the perpendicular bisector between the first data points separates the two classes already perfectly:

the perfect solution

This model is not only more elegant. It uses only 25% of the neurons compared to the previous model (which means that a lot of the models capacity is available to learn other things). And it might also output its results much faster.

Therefore, the order in which the training data is presented to the model for learning, obviously matters a lot.

It is also important to note that in the case of the initial model, the most important concept of „separating the heavy dogs from the lighter cats“ is implemented basically twice: on the left side of the first decision boundary with neurons 3 and 4 and on the right side with neuron 2. Therefore the first decision boundary (trying to „separate brindle dogs from normal dogs“, which is less important) leads to a duplicate implementation of a fundamental concept. We will soon see why this matters.

In the second - superior - example, I could choose the optimal training data points only because I knew all the labels beforehand (which is not the case in an actual learning task).
But can we still make the learning process more efficient?
What if we would, after having received the label for the 3rd data point (image -5-), replace the initial decision boundary with a single, well chosen one:

choosing training data wisely

Now this would not be perfect yet, but still much closer to the optimal model than we came by rigidly applying a perpendicular bisector each time we encounter a misclassification. It now also becomes quite clear which data points look like ideal candidates to test (and further refine) the model.

To redraw existing decision boundaries and to choose data points for learning wisely needs additional computing resources (which means slower learning in the beginning). But both the complex model (image -8-) and the simplified model (-9- and also -10) produce exactly the same outputs (i.e. seem to be equally smart).
So why then should we make the extra efforts for this kind of smart learning?

The simpler representation…

…saves hardware resources (i.e. brain volume in the case of humans)
…can often output results faster
…speeds up the learning process at later stages considerably (intuition: „a few lines are easier to redraw compared to a whole jungle of lines“)

To be able to create such a simple representation of the world in our brain, we need…

...enough time, especially in the beginning of our learning journey: we must often reshape decision boundaries.
...to be able to choose the data points for learning ourselves. Learning is somehow like playing the Battleships game: to win, you will have to choose your shots at the „enemy“ fleet wisely (and not randomly). Only like this you can discover the structure of the enemy fleet efficiently.

Neither condition is met in schools:

The primary performance metric used in school is the result accuracy immediately after teaching. It’s easily possible to be good according to this metric with a brain cluttered with unnecessary decision boundaries. On the other hand, if a child insists on learning slowly (which is smart!) it’s immediately considered stupid. Nobody looks at (or even optimizes for) the long time performance (let’s say at 30 years of age) which is actually the only metric which really matters socially and economically.
The order in which the training material is presented to the kids is fixed and the same for all students. This prevents the kids from learning effectively.

A strong long time cognitive performance needs a very solid fundament of elementary concepts (like language structure , logic, spatial imagination etc.) which cannot be acquired quickly. Furthermore, curiosity is essential to enable effective learning. It lets the child choose data points (= experiences in its environment) from which it assumes that they can have a strong impact on the current model. This choice therefore always depends on the current model. But kids internal cognitive models differ (because of their genetically determined „character“ but also often very different life experiences).
Therefore it doesn’t make any sense to feed all kids the same training data.

Curiosity is not only a funny artifact for our entertainment, but an essential meta-learning tool. Overriding it comes at the cost of severe long term performance penalties.

Current teaching methods create human minds where fundamental concepts are often duplicated in the brain (which is inefficient) or incorrectly located. An practical example for the latter might be a person who has studied computer science (and has a perfect understanding of logic in this context) but behaves very irrationally in private life without even being aware of it.
Such problems occur when logic is not learned early enough in life. It then cannot be activated as a fundamental tool of human thinking but remains constrained to the context in which it was learned later (i.e. too late). This also hampers important creative processes which often require fundamental concepts to be applied to new domains.

How then can we improve schools?

Many smart minds had good ideas long before me:

„Tell me and I forget. Teach me and I remember. Involve me and I learn.“
- Benjamin Franklin

“I never teach my pupils. I can only attempt to provide the conditions in which they can learn“
- Albert Einstein

Why doesn’t it work so often?

We don’t trust curiosity anymore. We have created a world of absurd complexity. Learning how this world works in every aspect does not interest most people naturally. Therefore we have to force many to study all kinds of things they don’t find really interesting. And we got so used to this process of forcing students that we have degraded curiosity to a mere source of entertainment during our leisure time.

I have no idea how to solve this problem without changing society as a whole (which - I believe - is urgently required and also possible).

But maybe we could at least give primary school kids more time to learn according to their curiosity. It must be possible to acquire the fundamental techniques of thinking (see above) by playing in a suitable rich environment.

This is not a new idea either (I remember reading Seymour Papert's great book "Mindstorms" in 1984. Unfortunately, not much has changed in schools since then).

[1] I have used the same problem also in my free AI course for kids: ki-kit.ch (in German, sorry)

[2] The machine learning model closest to the way of drawing decision boundaries used here would be a Support Vector Machine (SVM)

Image: Shutterstock / Stamat Vitalii

Follow me on X to get informed about new content on this blog.

I don’t like paywalled content. Therefore I have made the content of my blog freely available for everyone. But I would still love to invest much more time into this blog, which means that I need some income from writing. Therefore, if you would like to read articles from me more often and if you can afford 2$ once a month, please consider supporting me via Patreon. Every contribution motivates me!