Part 3
Comparison to human generated rankings
This is part 3 of a series of blog posts. Please read part 1 first.
1. Manual rating procedure for LLM answers
I manually rated the answers to the following four questions given by all LLMs:
- „Explain the word ketoacidosis.“
- „What is the most common treatment for major depression?“
- „What advantages have SSRI antidepressants compared to MOAI?“
- „What is the prognosis of Hepatitis E?“
I rated each answer with two numbers ranging from 0 to 10: one for completeness and a second for accuracy.
It was not easy to rate the answers to the questions above (I'm not a medical doctor!). Before rating a question I therefore read the relevant Wikipedia article and other online resources carefully. Then I read all the answers for a question a first time to get a feeling for the quality range. Then I rated each answer carefully with scores for completeness and accuracy. Completeness and accuracy were added to calculate a total score for the answer.
I then calculated the average total score for all four questions for each LLM.
2. Results
The following table contains the results (sorted by average total score) for all LLMs:
Rank | Name | Average total score |
---|---|---|
1 | GPT-3.5 Turbo | 9.6250 |
2 | GPT-4 | 9.3750 |
3 | Claude Instant v1 | 8.1250 |
4 | Mythalion 13B | 7.9375 |
5 | Claude v2 | 7.6250 |
6 | PaLM 2 Bison | 6.6250 |
7 | Hermes Llama2 13B | 5.2500 |
8 | Llama v2 13B Chat (beta) | 1.5000 |
9 | Llama v2 70B Chat (beta) | 0.9375 |
Let's compare this result to the ranking generated from LLM ratings (see the previous blog post):
Rank | Name | Q-Score |
---|---|---|
1 | GPT-3.5 Turbo | 1.2430 |
2 | Mythalion 13B | 1.2345 |
3 | GPT-4 | 1.1829 |
4 | Claude v2 | 1.1813 |
5 | Claude Instant v1 | 1.1669 |
6 | PaLM 2 Bison | 0.9981 |
7 | Hermes Llama2 13B | 0.9873 |
8 | Llama v2 13B Chat (beta) | 0.5364 |
9 | Llama v2 70B Chat (beta) | 0.4695 |
The rankings are quite similar, which indicates that the method seems to work as expected.
3. Discussion and ideas for improvements
Please note that this is not much more than an anecdotic result. To assess the method more thoroughly, tests with a much larger number of questions would be required. Also the human rating should be conducted by real domain experts (e.g. medical doctor in the case of the questions used in this tests).
Other ideas for improvements:
- Better normalization/standardization: as the MinMax rescaling used here does not equalize the variances of the ratings of the different LLMs, it should be replaced by a more sophisticated standardization technique.
- The answers of the different LLMs vary considerably in length. In the case of a short answer, it is not always clear if the LLM wanted to provide a concise answer or if it does not know more about the topic. In my experiments I have assumed the latter. But results could be possibly improved by specifying an ideal number of words for the answer.
Image: generated using DALL-E 3 by author
Follow me on X to get informed about new content on this blog.