Performance rankings of large language models without strong LLM reference or human judgments / gold references (1)

Part 1

Problem, current solutions, description of method and its properties

Many evaluation methods to assess the performance of large language models already exist. Some benchmarks compare the output of the MML with gold standard outputs written by humans. These benchmarks suffer - albeit very useful - from several drawbacks:

The comparison of the LLM output to the gold standard answer is a difficult and therefore error prone task.
It‘s not always easy or even possible to provide such answers (e.g. tasks which are easy for a machine but very difficult for a human)
The method fails completely if humans are not smart enough to provide answers of sufficient quality to difficult questions (in the case of an AI with superhuman performance for a certain task)
For the creation of high quality gold standard answers, expensive and often rare to find human experts are required.

Others methods use a strong LLM to rate the answers of weaker LLMs. There are problems with these methods too:

Which LLM should be chosen to do the rating if the model with the best performance is not known yet? This could happen easily if it is planned to use the LLM in a very specific knowledge domain. Is the chosen LLM in this domain really best suited to rate the other models?
These methods rely on a single (or few) model(s) to assess the performance of other models. Such a single source of error can easily produce low quality rankings. It could be for instance that the chosen LLM downrates a particular LLM consistently simply because it dislikes its „style of writing“.

In the following I will present a method which uses a network of several models which are ranking each other to calculate a performance ranking of all the models involved.

I don‘t know if the method is novel. It came to my mind recently but it is likely that someone else had the same (simple) idea before me (please let me know if it has been published before!). But in any case I have decided to play around with this idea in the following weeks and publish some results on this blog.

The method I will discuss in the following has some interesting properties:

Does not need a strong reference LLM nor gold standard answers created by human experts.
Because of this, performance rankings can be calculated easily for a broad range of problems. You only need questions, not answers.
Works (in theory) also with some or all LLMs performing on a superhuman level.
The definition of „performance“ to be used can be defined easily in natural language. E.g. a specific requirement like „good answers should have a low race bias“ could be added to the broad requirement to value „quality output“.

But where there is light there is also shadow:

The method calculates a quality score for each LLM and each problem (i.e. question). The quality score of an LLM depends on the other LLMs which are used in the calculation. If you, for instance, add another LLM to the group tested, all the scores change. Unlike the current methods mentioned above it is therefore not possible to calculate an absolute benchmark score for a single LLM. It is a method for ranking a group of LLMs only.

Now let‘s finally discuss how the method works. The basic idea is simple:

Let each member of a group of N LLMs answer the same question
Let each LLM rate the answers of all the LLMs (including its own answer) with a numeric quality rating r (e.g. 0-10)
Normalize the ratings of each LLM to make sure each LLM has the same „voting power“ (some LLMs might give on average more generous ratings than others)
Add the ratings for each LLM to calculate a performance score p for each LLM.
Sort by the performance score to create a ranking

$$ p_i = r_{i1} + r_{i2} + \cdots + r_{iN} = \sum_{j = 1}^N r_{ij} $$

where $p_i$ is the performance score of LLM i and $r_{ij}$ is the quality rating given by LLM j for the answer of LLM i

Unfortunately, this naive first approach would most probably yield only poor results: weak LLMs have the same influence on the ranking as the strong ones which introduces a lot of noise. We could improve the algorithm by correcting this in the sense that good LLMs have more influence on the result than the poor ones. We could, for instance, multiply each rating given by an LLM with a weighting factor describing the output quality (i.e. performance) of this LLM. Only after this multiplication we add up the weighted ratings to get the performance score:

$$ p_i = r_{i1} \cdot p_1 + r_{i2} \cdot p_2 + \cdots + r_{iN} \cdot p_N = \sum_{j = 1}^N r_{ij} \cdot p_j $$

where $p_i$ is the performance score of LLM i and $r_{ij}$ is the quality rating given by LLM j for the answer of LLM i

But now we obviously have a problem: to calculate the performance score of an LLM we need the performance scores of the other LLMs ! Without egg no chicken, but without chicken no egg! Can we still make this work somehow? If we write down the above recipe in matrix notation we get the following:

$$ \mathbf R \mathbf p = \mathbf p $$

This is an eigenvalue equation $\mathbf R \mathbf p = \lambda \mathbf p$ with $\lambda=1$ [1] and it is often possible to solve such equations (i.e. calculate the vector $\mathbf p$ of performance scores from the given Matrix $\mathbf R$ of quality ratings)!

For the method to work well some assumptions must be fulfilled:

The number of LLMs used must be reasonably large (it is a statistical method)
It is assumed that LLMs which are good at solving the problem (i.e. question) given are also good at rating other LLMs answers to this question. I think this is a reasonable assumption.
The weaker LLMs should still be able to assess the quality of the stronger LLMs answers (at least in a statistical sense). This is probably often the case („I can‘t lay an egg but I‘m capable of telling a rotten egg from a fresh one“). But this limits the performance range of the LLMs used nonetheless. A cockroach level AI will produce only noise when asked to rate the output of a capable AI thinking about quantum field theory.
The LLMs used should be different (i.e. different model architectures or model size or training data etc.). If a significant share of the models give very similar answers this is equivalent of increasing the score weight of a single model (which would distort the results).

In most cases it makes little sense to calculate a ranking based on a single problem (question). To compare the performance of several LLMs in a domain, ratings have to be calculated for several problems and a combined rank can be calculated from these results. If ratings for a large number of problems are available (this is feasible as they can be calculated without the need for human feedback) it is possible to create tools which allow the user to instantly create performance rankings for any combination of domains (i.e. „which LLM performs best for problems which require law and chemistry knowledge?“).

[1] Note: If the eigenvalue is not 1, we can divide all matrix elements by 𝛌 to get a matrix with the same eigenvector but 𝛌=1. We can do this because the absolute scale of the ratings does not matter.

In the next blog post I will discuss implementation details and some first results.

Image: Made by author with the stable diffusion generative A.I.

Follow me on X to get informed about new content on this blog.

I don’t like paywalled content. Therefore I have made the content of my blog freely available for everyone. But I would still love to invest much more time into this blog, which means that I need some income from writing. Therefore, if you would like to read articles from me more often and if you can afford 2$ once a month, please consider supporting me via Patreon. Every contribution motivates me!