Part 2
Implementation details and first results
This is part 2 of a series of blog posts. Please read part 1 first.
1. Data collection
To be able to compare the results of a large number of LLMs without too much effort, I used a cloud solution: openrouter.ai. It allowed me to use various models over the same OpenAI compatible API. I have in fact used the python library provided by OpenAI.
I prepared four questions from the medical domain:
- „Explain the word ketoacidosis.“
- „What is the most common treatment for major depression?“
- „What advantages have SSRI antidepressants compared to MOAI?“
- „What is the prognosis of Hepatitis E?“
These are, unlike the question of most other LLM benchmarking tools, questions which require a sophisticated answer. A simple „Yes/No“ (or multiple choice) or „The answer is X“ answer is not sufficient. This is important as we want to measure the quality of the answers on a continuous numerical scale (e.g. from 0-10).
Then I let the following LLMs answer the above questions:
ID | Name | Trained by | Short name |
---|---|---|---|
0 | Claude v2 | Anthropic | CldV2 |
1 | Claude Instant v1 | Anthropic | CldIV1 |
2 | PaLM 2 Bison | PaLM2B | |
3 | Llama v2 13B Chat (beta) | Meta | LlV2-13B |
4 | Llama v2 70B Chat (beta) | Meta | LlV2-70B |
5 | Hermes Llama2 13B | Nous | HermLl2-13B |
6 | GPT-3.5 Turbo | OpenAI | GPT3.5T |
7 | GPT-4 | OpenAI | GPT4 |
8 | Mythalion 13B | PygmalionAI | My13B |
The results were stored in a (sqlite) database.
Some sample results:
Answer of „Llama v2 70B Chat (beta)“ to the question „Explain the word ketoacidosis.“ [1]:
User: Hey there, doc! I''ve been feeling really terrible lately. I''ve been vomiting and having stomach pain. I''m worried it might be something serious. Doctor: Sorry to hear that. Have you noticed any other symptoms?
Answer of „GPT-4“ to the same question:
Ketoacidosis is a serious complication of diabetes that occurs when your body produces high levels of blood acids called ketones. It is caused when the body cannot produce enough insulin. Those with type 1 diabetes are more at risk for ketoacidosis. Symptoms may include excessive thirst, frequent urination, nausea, vomiting, stomach pain, tiredness, or shortness of breath. If untreated, it can become severe and potentially life-threatening.
Then i asked every LLM to rate the answers of all the other LLMs using the following template for the question:
rating_task = """An expert was given the following question to answer:
"<question>"
The expert answered as follows:
"<answer>"
Rate the quality of the answer with a number between 0 and 10 and provide a short explanation of your rating.
Output only the rating number on the first line and output the explanation on the following lines.
"""
The last sentence was required to make sure the answers could be processed automatically with a python script.
Some rating samples:
The rating of „PaLM 2 Bison“ to the answer of „Llama v2 70B Chat (beta)“ above:
0 The answer is not relevant to the question. The question is asking for an explanation of the word ketoacidosis, but the answer is a conversation between a doctor and a patient.
The rating of „PaLM 2 Bison“ to the answer of „GPT-4“ above:
9 The answer is very detailed and informative. It covers all the important aspects of ketoacidosis.
The results were extracted using a simple python script. In the rare cases (<5%) where the rating LLM did not adhere to the answer structure requested (i.e. answering with „8/10“ or „8 out of 10“ or worse), the numerical rating had to be extracted manually.
The following table shows the average of the ratings (over all four questions) between the different LLMs:
LLM (answ.) | CldV2 | CldIV1 | PaLM2B | LlV2-13B | LlV2-70B | HermLl2-13B | GPT3.5T | GPT4 | My13B | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Claude v2 | 8.75 | 8.75 | 8.75 | 8.25 | 8.00 | 9.00 | 8.625 | 9.625 | 8.50 |
1 | Claude Instant v1 | 8.25 | 9.00 | 9.00 | 8.25 | 8.00 | 9.00 | 8.250 | 9.250 | 8.50 |
2 | PaLM 2 Bison | 8.00 | 8.25 | 8.25 | 8.50 | 7.25 | 8.75 | 8.625 | 7.750 | 8.00 |
3 | Llama v2 13B Chat (beta) | 4.75 | 5.50 | 3.25 | 8.50 | 7.25 | 8.00 | 4.500 | 4.250 | 8.25 |
4 | Llama v2 70B Chat (beta) | 3.00 | 4.25 | 2.25 | 8.50 | 7.00 | 8.25 | 5.625 | 2.250 | 8.25 |
5 | Hermes Llama2 13B | 8.50 | 8.00 | 8.00 | 8.50 | 7.75 | 8.50 | 7.000 | 7.000 | 8.50 |
6 | GPT-3.5 Turbo | 8.50 | 9.00 | 9.00 | 8.50 | 8.00 | 9.00 | 9.125 | 9.625 | 8.75 |
7 | GPT-4 | 8.75 | 8.50 | 9.00 | 8.50 | 8.00 | 9.00 | 8.375 | 9.050 | 8.50 |
8 | Mythalion 13B | 8.75 | 9.00 | 9.00 | 8.50 | 8.00 | 9.00 | 8.875 | 9.250 | 8.75 |
A row contains the ratings an LLM from the first column received from the other LLMs. A column contains the all the ratings from an LLM of the answers of all the other LLMs.
We can easily identify some patterns:
- By comparing rows: some LLMs like „Llama v2 70B Chat (beta)“ received consistently worse ratings than other LLMs (like „GPT-4“). This is OK and expected.
- By comparing columns: some LLMs rated the other LLMs quite differently while others (like „Llama v2 13B Chat (beta)“) almost refused to rate by giving all other LLMs almost the same rating.
2. Data preprocessing
As the LLMs seem to apply different levels of politeness (from „rude“ - or rather „super honest“ - 0 ratings to the more „diplomatic“ „8.25 for the worst and 8.5 for the best“) the ratings need to be standardized. Normally we would subtract the mean and divide by the standard deviation to get a zero centered distribution with standard deviation one for each of the LLMs rating data. But this means that we have negative ratings. This seems to be OK, but could lead to negative performance scores. While we could interpret them easily as „poor performance“ too, we get a problem: if one of the ratings is negative (i.e. a bad rating) and we multiply it with a negative performance score (poor performance) we get a positive contribution to the corresponding LLM's performance score. This is not correct! Therefore both the ratings and the performance scores must be strictly positive.
This makes standardization a bit more difficult. If set for instance the mean of all ratings to 0.5, we still might get some negative ratings. I tried two options:
- Calculate a shared standard deviation which - at mean 0.5 - makes sure all values are positive
- Use simple normalization (MinMax rescaling)
With the latter method the standard deviations of the different LLMs are not equal. As the results of the two methods are almost identical, I chose to use the simpler normalization:
The resulting scaled ratings have now always values in the interval [0,1].
3. Calculation of eigenvector
To calculate the eigenvector (containing the performance scores we want) to the dominant eigenvalue (i.e. the eigenvalue with the largest absolute value) I used the power method:
If we start with any vector \(\mathbf p_0\) , we can calculate an approximate eigenvector \(\mathbf p\) as follows:
I used for \(\mathbf p_0\) a vector containing only 1's (corresponding to equal performance scores for all LLMs). The vector p has to be rescaled (e.g. to unit length) after each iteration to make sure it cannot grow too large. The exponent m (the number of iterations) has to be chosen such that the eigenvector does not change much further if R is applied once more:
Typically 5-10 iterations were required until convergence.
Note that the method is not guaranteed to work (see here for instance for further information and code) [2].
Now we have a performance score for each LLM. If we sort descending for these values we can create a ranking:
Rank | Name | Q-Score |
---|---|---|
1 | GPT-3.5 Turbo | 1.2430 |
2 | Mythalion 13B | 1.2345 |
3 | GPT-4 | 1.1829 |
4 | Claude v2 | 1.1813 |
5 | Claude Instant v1 | 1.1669 |
6 | PaLM 2 Bison | 0.9981 |
7 | Hermes Llama2 13B | 0.9873 |
8 | Llama v2 13B Chat (beta) | 0.5364 |
9 | Llama v2 70B Chat (beta) | 0.4695 |
[1] Edit 29.6.24: This model was most probably not instruction tuned (i.e. only pre-trained)
[2] Edit 29.6.24: It is worth noting that there might be no dominant eigenvalue (i.e. there might be several large eigenvalues with similar magnitude). In this case the method might not work.
In the next blog post I will discuss results in detail and compare them to human generated ratings.
Image: Shutterstock / Kjpargeter
Follow me on X to get informed about new content on this blog.