Performance rankings of large language models without strong LLM reference or human judgments / gold references (3)

Part 3

Comparison to human generated rankings

This is part 3 of a series of blog posts. Please read part 1 first.

1. Manual rating procedure for LLM answers

I manually rated the answers to the following four questions given by all LLMs:

„Explain the word ketoacidosis.“
„What is the most common treatment for major depression?“
„What advantages have SSRI antidepressants compared to MOAI?“
„What is the prognosis of Hepatitis E?“

I rated each answer with two numbers ranging from 0 to 10: one for completeness and a second for accuracy.

It was not easy to rate the answers to the questions above (I'm not a medical doctor!). Before rating a question I therefore read the relevant Wikipedia article and other online resources carefully. Then I read all the answers for a question a first time to get a feeling for the quality range. Then I rated each answer carefully with scores for completeness and accuracy. Completeness and accuracy were added to calculate a total score for the answer.

I then calculated the average total score for all four questions for each LLM.

2. Results

The following table contains the results (sorted by average total score) for all LLMs:

Rank	Name	Average total score
1	GPT-3.5 Turbo	9.6250
2	GPT-4	9.3750
3	Claude Instant v1	8.1250
4	Mythalion 13B	7.9375
5	Claude v2	7.6250
6	PaLM 2 Bison	6.6250
7	Hermes Llama2 13B	5.2500
8	Llama v2 13B Chat (beta)	1.5000
9	Llama v2 70B Chat (beta)	0.9375

Let's compare this result to the ranking generated from LLM ratings (see the previous blog post):

Rank	Name	Q-Score
1	GPT-3.5 Turbo	1.2430
2	Mythalion 13B	1.2345
3	GPT-4	1.1829
4	Claude v2	1.1813
5	Claude Instant v1	1.1669
6	PaLM 2 Bison	0.9981
7	Hermes Llama2 13B	0.9873
8	Llama v2 13B Chat (beta)	0.5364
9	Llama v2 70B Chat (beta)	0.4695

The rankings are quite similar, which indicates that the method seems to work as expected.

3. Discussion and ideas for improvements

Please note that this is not much more than an anecdotic result. To assess the method more thoroughly, tests with a much larger number of questions would be required. Also the human rating should be conducted by real domain experts (e.g. medical doctor in the case of the questions used in this tests).

Other ideas for improvements:

Better normalization/standardization: as the MinMax rescaling used here does not equalize the variances of the ratings of the different LLMs, it should be replaced by a more sophisticated standardization technique.
The answers of the different LLMs vary considerably in length. In the case of a short answer, it is not always clear if the LLM wanted to provide a concise answer or if it does not know more about the topic. In my experiments I have assumed the latter. But results could be possibly improved by specifying an ideal number of words for the answer.

Image: generated using DALL-E 3 by author

Follow me on X to get informed about new content on this blog.

I don’t like paywalled content. Therefore I have made the content of my blog freely available for everyone. But I would still love to invest much more time into this blog, which means that I need some income from writing. Therefore, if you would like to read articles from me more often and if you can afford 2$ once a month, please consider supporting me via Patreon. Every contribution motivates me!