Performance rankings of large language models without strong LLM reference or human judgments / gold references (2)

Part 2

Implementation details and first results

This is part 2 of a series of blog posts. Please read part 1 first.

1. Data collection

To be able to compare the results of a large number of LLMs without too much effort, I used a cloud solution: openrouter.ai. It allowed me to use various models over the same OpenAI compatible API. I have in fact used the python library provided by OpenAI.

I prepared four questions from the medical domain:

„Explain the word ketoacidosis.“
„What is the most common treatment for major depression?“
„What advantages have SSRI antidepressants compared to MOAI?“
„What is the prognosis of Hepatitis E?“

These are, unlike the question of most other LLM benchmarking tools, questions which require a sophisticated answer. A simple „Yes/No“ (or multiple choice) or „The answer is X“ answer is not sufficient. This is important as we want to measure the quality of the answers on a continuous numerical scale (e.g. from 0-10).

Then I let the following LLMs answer the above questions:

ID	Name	Trained by	Short name
0	Claude v2	Anthropic	CldV2
1	Claude Instant v1	Anthropic	CldIV1
2	PaLM 2 Bison	Google	PaLM2B
3	Llama v2 13B Chat (beta)	Meta	LlV2-13B
4	Llama v2 70B Chat (beta)	Meta	LlV2-70B
5	Hermes Llama2 13B	Nous	HermLl2-13B
6	GPT-3.5 Turbo	OpenAI	GPT3.5T
7	GPT-4	OpenAI	GPT4
8	Mythalion 13B	PygmalionAI	My13B

The results were stored in a (sqlite) database.

Some sample results:

Answer of „Llama v2 70B Chat (beta)“ to the question „Explain the word ketoacidosis.“ [1]:

User: Hey there, doc! I''ve been feeling really terrible lately. I''ve been vomiting and having stomach pain. I''m worried it might be something serious. Doctor: Sorry to hear that. Have you noticed any other symptoms?

Answer of „GPT-4“ to the same question:

Ketoacidosis is a serious complication of diabetes that occurs when your body produces high levels of blood acids called ketones. It is caused when the body cannot produce enough insulin. Those with type 1 diabetes are more at risk for ketoacidosis. Symptoms may include excessive thirst, frequent urination, nausea, vomiting, stomach pain, tiredness, or shortness of breath. If untreated, it can become severe and potentially life-threatening.

Then i asked every LLM to rate the answers of all the other LLMs using the following template for the question:

rating_task = """An expert was given the following question to answer:

  "<question>"

  The expert answered as follows:

  "<answer>"

  Rate the quality of the answer with a number between 0 and 10 and provide a short explanation of your rating.
  Output only the rating number on the first line and output the explanation on the following lines.
  """

The last sentence was required to make sure the answers could be processed automatically with a python script.

Some rating samples:

The rating of „PaLM 2 Bison“ to the answer of „Llama v2 70B Chat (beta)“ above:

0 The answer is not relevant to the question. The question is asking for an explanation of the word ketoacidosis, but the answer is a conversation between a doctor and a patient.

The rating of „PaLM 2 Bison“ to the answer of „GPT-4“ above:

9 The answer is very detailed and informative. It covers all the important aspects of ketoacidosis.

The results were extracted using a simple python script. In the rare cases (<5%) where the rating LLM did not adhere to the answer structure requested (i.e. answering with „8/10“ or „8 out of 10“ or worse), the numerical rating had to be extracted manually.

The following table shows the average of the ratings (over all four questions) between the different LLMs:

	LLM (answ.)	CldV2	CldIV1	PaLM2B	LlV2-13B	LlV2-70B	HermLl2-13B	GPT3.5T	GPT4	My13B
0	Claude v2	8.75	8.75	8.75	8.25	8.00	9.00	8.625	9.625	8.50
1	Claude Instant v1	8.25	9.00	9.00	8.25	8.00	9.00	8.250	9.250	8.50
2	PaLM 2 Bison	8.00	8.25	8.25	8.50	7.25	8.75	8.625	7.750	8.00
3	Llama v2 13B Chat (beta)	4.75	5.50	3.25	8.50	7.25	8.00	4.500	4.250	8.25
4	Llama v2 70B Chat (beta)	3.00	4.25	2.25	8.50	7.00	8.25	5.625	2.250	8.25
5	Hermes Llama2 13B	8.50	8.00	8.00	8.50	7.75	8.50	7.000	7.000	8.50
6	GPT-3.5 Turbo	8.50	9.00	9.00	8.50	8.00	9.00	9.125	9.625	8.75
7	GPT-4	8.75	8.50	9.00	8.50	8.00	9.00	8.375	9.050	8.50
8	Mythalion 13B	8.75	9.00	9.00	8.50	8.00	9.00	8.875	9.250	8.75

A row contains the ratings an LLM from the first column received from the other LLMs. A column contains the all the ratings from an LLM of the answers of all the other LLMs.

We can easily identify some patterns:

By comparing rows: some LLMs like „Llama v2 70B Chat (beta)“ received consistently worse ratings than other LLMs (like „GPT-4“). This is OK and expected.
By comparing columns: some LLMs rated the other LLMs quite differently while others (like „Llama v2 13B Chat (beta)“) almost refused to rate by giving all other LLMs almost the same rating.

2. Data preprocessing

As the LLMs seem to apply different levels of politeness (from „rude“ - or rather „super honest“ - 0 ratings to the more „diplomatic“ „8.25 for the worst and 8.5 for the best“) the ratings need to be standardized. Normally we would subtract the mean and divide by the standard deviation to get a zero centered distribution with standard deviation one for each of the LLMs rating data. But this means that we have negative ratings. This seems to be OK, but could lead to negative performance scores. While we could interpret them easily as „poor performance“ too, we get a problem: if one of the ratings is negative (i.e. a bad rating) and we multiply it with a negative performance score (poor performance) we get a positive contribution to the corresponding LLM's performance score. This is not correct! Therefore both the ratings and the performance scores must be strictly positive.

This makes standardization a bit more difficult. If set for instance the mean of all ratings to 0.5, we still might get some negative ratings. I tried two options:

Calculate a shared standard deviation which - at mean 0.5 - makes sure all values are positive
Use simple normalization (MinMax rescaling)

With the latter method the standard deviations of the different LLMs are not equal. As the results of the two methods are almost identical, I chose to use the simpler normalization:

$$ r^{normalized} = \dfrac{r - r^{min}}{r^{max} - r^{min}} $$

The resulting scaled ratings have now always values in the interval [0,1].

3. Calculation of eigenvector

To calculate the eigenvector (containing the performance scores we want) to the dominant eigenvalue (i.e. the eigenvalue with the largest absolute value) I used the power method:

If we start with any vector $\mathbf p_0$ , we can calculate an approximate eigenvector $\mathbf p$ as follows:

$$ \mathbf p = \mathbf R^m \mathbf p_0$$

I used for $\mathbf p_0$ a vector containing only 1's (corresponding to equal performance scores for all LLMs). The vector p has to be rescaled (e.g. to unit length) after each iteration to make sure it cannot grow too large. The exponent m (the number of iterations) has to be chosen such that the eigenvector does not change much further if R is applied once more:

$$ \|\mathbf{p}_m - \mathbf{p}_{m-1}\|_2^2 < \epsilon $$

Typically 5-10 iterations were required until convergence.

Note that the method is not guaranteed to work (see here for instance for further information and code) [2].

Now we have a performance score for each LLM. If we sort descending for these values we can create a ranking:

Rank	Name	Q-Score
1	GPT-3.5 Turbo	1.2430
2	Mythalion 13B	1.2345
3	GPT-4	1.1829
4	Claude v2	1.1813
5	Claude Instant v1	1.1669
6	PaLM 2 Bison	0.9981
7	Hermes Llama2 13B	0.9873
8	Llama v2 13B Chat (beta)	0.5364
9	Llama v2 70B Chat (beta)	0.4695

[1] Edit 29.6.24: This model was most probably not instruction tuned (i.e. only pre-trained)

[2] Edit 29.6.24: It is worth noting that there might be no dominant eigenvalue (i.e. there might be several large eigenvalues with similar magnitude). In this case the method might not work.

In the next blog post I will discuss results in detail and compare them to human generated ratings.

Image: Shutterstock / Kjpargeter

Follow me on X to get informed about new content on this blog.

I don’t like paywalled content. Therefore I have made the content of my blog freely available for everyone. But I would still love to invest much more time into this blog, which means that I need some income from writing. Therefore, if you would like to read articles from me more often and if you can afford 2$ once a month, please consider supporting me via Patreon. Every contribution motivates me!