Performance rankings of large language models without strong LLM reference or human judgments / gold references (3)
Part 3: Comparison to human generated rankings
Part 3: Comparison to human generated rankings
Part 2: Implementation details and first results
Part 1: Problem, current solutions, description of method and its properties