R&D Hub

Published on Friday, June 7, 2024

New Research Paper on Advances and Challenges in Evaluating LLM-Based Applications

New Research Paper on Advances and Challenges in Evaluating LLM-Based Applications

This month, we are excited to share the latest research from the NAEP R&D community, focusing on the evaluation of large language models (LLMs). “The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches” will be presented at the LLM4Eval workshop at the 2024 Association for Computer Machinery Special Interest Group on Information Retrieval (ACM SIGIR) conference on July 18, 2024. This paper from AIR researchers Bhashithe Abeysinghe and Ruhan Circi explores innovative approaches to evaluating custom AI applications, addressing a crucial barrier to faster progress in generative AI.

Abeysinghe and Circi’s work highlights the challenges in evaluating LLM-based applications. Many existing methods fall short by evaluating only a single factor. Therefore, the researchers emphasize the importance of understanding the current challenges and exploring ways to create better evaluation frameworks. While existing frameworks—such as automated evaluation, which is fast and cheap, and human evaluation, which is slow and reliable but biased toward human expertise—are valuable, they are not without flaws. Human evaluation, considered the gold standard for LLM assessment, faces major problems related to repeatability and human biases.

Effective evaluation methods are critical for assessing changes within AI systems. Without efficient evaluation mechanisms, it becomes challenging to determine which modifications enhance performance. In their research, Abeysinghe and Circi employed multiple approaches to tackle this issue, including the integration of a cognitive framework. They utilized Bloom’s Taxonomy to construct evaluation questions, ensuring a structured and comprehensive assessment and an additional dimension to investigate other than the traditional factors. Furthermore, they conducted a rigorous comparison between three evaluation strategies: automated metrics, traditional human evaluation, and LLM-based evaluation.

This comprehensive method offers valuable insights into how different evaluation techniques can be optimized and combined to achieve more accurate and reliable results in LLM-powered applications. The novel area of LLM evaluation involving the use of LLMs to evaluate other LLMs allows the mimicking of human evaluation, but without the associated costs and at a faster pace. Although this approach is still new and lacks literature support, its potential as a proper evaluation method is promising.

The above image from the paper shows a multi-dimensional evaluation of EdTalk, a chatbot designed to use LLMs to enhance accessibility to information in education reports. The evaluation encompasses three experiments Abeysinghe and Circi conducted with various types of evaluators. These include “Novice” evaluators, individuals new to the domain of the chatbot but possessing some experience with the content; “Expert” evaluators, who have worked in the domain of the chatbot for more than 2 years; and an “LLM” evaluator. The categories of Bloom’s Taxonomy are represented in each graph’s spokes. The radial axis shows the rating given by each evaluator, measured on a 5-point Likert scale. The lines in the three graphs show the various factors considered by an evaluator in their analysis.

If this new research interests you, we invite you to explore the 2024 ACM SIGIR conference to see the full text of the paper presented on July 18, 2024. Additionally, consider subscribing to our mailing list to stay up to date on the latest in NAEP research and its intersections with the ever-evolving fields of AI and technology!

Comments (0)Number of views (569)
Print

More links

«October 2024»
MonTueWedThuFriSatSun
30123456
78910111213
14151617181920
21222324252627
28293031123
45678910