R&D Hub

Published on Friday, February 23, 2024

New Working Paper: Evaluating the Surge in Chatbot Development

New Working Paper: Evaluating the Surge in Chatbot Development

Explore the latest insights in research and development from NAEP researchers in the R&D program. Given the current buzz around AI, this month we’re excited to share a working paper on the evaluation of chatbots by Ruhan Circi and Bhashithe Abeysinghe, NAEP researchers from the American Institutes for Research (AIR). The full text of the working paper is included below. Don’t miss out on these valuable perspectives – subscribe now to stay ahead in technology and innovation!


The landscape of chatbot development, propelled by advancements in Large Language Model (LLM) APIs, is rapidly evolving. This evolution is marked by technological advancements and the expanding application of chatbots across various sectors, including education, health, and the workforce. The healthcare sector sees chatbots assisting in patient care and information dissemination (Abd-Alrazaq et al., 2020), while in workforce, they are being adopted for tasks ranging from employee assistance to automating routine operations. Within the educational landscape, particularly in facilitating learning and conducting assessments, a significant shift towards interactive and personalized education has been observed. Chatbots are increasingly utilized in education for personalized learning experiences and student support (Yamkovenko, 2023). This widespread adoption of chatbots to educational frameworks underscores the need for robust evaluation methods to ensure that chatbots meet performance and ethical standards. The challenge lies in bridging the gap between creating an LLM-powered application and developing a reliable system, necessitating careful consideration of the final product’s alignment with requirements (Srivastava et al., 2023). Evaluation must address technical proficiency and trust-oriented aspects, balancing operational efficiency with responsible usage while being mindful of common LLM pitfalls like hallucination and tone issues (Gallegos et al., 2023; Huang et al., 2023).

Development Phases

In the lifecycle of a chatbot, three critical phases are observed: a) selection of the LLM, wherein the foundational technology is chosen, b) iterative development, involving the application’s incremental refinement, and c) operational deployment, wherein the chatbot is introduced into a real-world environment. The underlying quality of the LLM is a pivotal factor, as it shapes the chatbot’s capabilities and risk profile (Guo et al., 2023; Liang et al., 2023). Developers may employ various approaches, like fine-tuning, LLM search with knowledge graphs, and more, each demanding specific evaluation criteria (Gao et al., 2023; Nori et al., 2023).

Chatbot Output Evaluation

Chatbots are designed to resolve user queries, ranging from domain-specific to general-purpose applications. The evaluation focuses on the chatbot’s alignment with its intended use case, assessing its effectiveness in meeting business goals or user expectations. Key evaluation components include:

  • Data Selection: Deciding whether to use entire conversations or individual utterances, and whether to use human-curated or automatically collated data.
  • Output Properties: Assessing the text for correctness, readability, informativeness, relevance, clarity, and avoidance of hallucination.
  • Grading: Approaches include grading individual utterances, entire conversations, or using comparative grading where two outputs are compared.
  • User Experience: Measuring the number of interactions, the helpfulness of chatbot suggestions, and the application’s intuitiveness.

Evaluation Techniques

The evaluation techniques are diverse, including:

  • N-gram-based metrics like BLEU, which offer easy calculation but may lack correlation with human judgment (Papineni et al., 2002).
  • Embedding-based metrics, such as BERTScore, which consider contextual information for a more nuanced analysis (Zhang et al., 2020).
  • Evaluator LLMs like ChatEval and GPTScore are emerging methods that use LLMs to assess chatbot responses, offering efficient, scalable evaluations, but are still in nascent stages (Chan et al., 2023; Fu et al., 2023).
  • Human evaluation, although resource-intensive, provides a crucial perspective, often considered the gold standard in assessing chatbot effectiveness (Finch et al., 2023).

Limitations and Recommendations

Automated metrics, while useful, often do not align with human evaluations, highlighting the need for a multi-faceted approach (van der Lee et al., 2019). Gupta and colleagues introduce a framework that uses information retrieval for automatic evaluation, incorporating metrics like coverage and answer rate (Gupta et al., 2022). However, human evaluation remains indispensable for a comprehensive understanding, despite its cost and complexity (Clark et al., 2021).


As the integration of chatbots into educational contexts becomes more prevalent, the need for rigorous evaluation and transparency in these systems intensifies. Ensuring the efficacy, ethical integrity, and effectiveness of chatbots in their educational roles necessitates a comprehensive approach. This involves a harmonious blend of both automated and human evaluation methods, tailored to scrutinize the nuanced functionalities and impacts of chatbots. Transparency about the operational mechanisms and decision-making processes of these chatbots is essential, providing users with the clarity and trust needed to fully embrace these advanced technological tools in their learning journeys.


Abd-Alrazaq, A., Safi, Z., Alajlani, M., Warren, J., Househ, M., & Denecke, K. (2020). Technical Metrics Used to Evaluate Health Care Chatbots: Scoping Review. Journal of Medical Internet Research, 22(6), e18301. https://doi.org/10.2196/18301

Chan, C.-M., Chen, W., Su, Y., Yu, J., Xue, W., Zhang, S., Fu, J., & Liu, Z. (2023). ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate (arXiv:2308.07201). arXiv. http://arxiv.org/abs/2308.07201

Clark, E., August, T., Serrano, S., Haduong, N., Gururangan, S., & Smith, N. A. (2021). All That’s “Human” Is Not Gold: Evaluating Human Evaluation of Generated Text (arXiv:2107.00061). arXiv. http://arxiv.org/abs/2107.00061

Finch, S. E., Finch, J. D., & Choi, J. D. (2023). Don’t Forget Your ABC’s: Evaluating the State-of-the-Art in Chat-Oriented Dialogue Systems (arXiv:2212.09180). arXiv. http://arxiv.org/abs/2212.09180

Fu, J., Ng, S.-K., Jiang, Z., & Liu, P. (2023). GPTScore: Evaluate as You Desire (arXiv:2302.04166). arXiv. http://arxiv.org/abs/2302.04166

Gallegos, I. O., Rossi, R. A., Barrow, J., Tanjim, M. M., Kim, S., Dernoncourt, F., Yu, T., Zhang, R., & Ahmed, N. K. (2023). Bias and Fairness in Large Language Models: A Survey (arXiv:2309.00770). arXiv. https://doi.org/10.48550/arXiv.2309.00770

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., & Wang, H. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey (arXiv:2312.10997). arXiv. https://doi.org/10.48550/arXiv.2312.10997

Guo, Z., Jin, R., Liu, C., Huang, Y., Shi, D., Supryadi, Yu, L., Liu, Y., Li, J., Xiong, B., & Xiong, D. (2023). Evaluating Large Language Models: A Comprehensive Survey (arXiv:2310.19736). arXiv. https://doi.org/10.48550/arXiv.2310.19736

Gupta, P., Rajasekar, A. A., Patel, A., Kulkarni, M., Sunell, A., Kim, K., Ganapathy, K., & Trivedi, A. (2022). Answerability: A custom metric for evaluating chatbot performance. In A. Bosselut, K. Chandu, K. Dhole, V. Gangal, S. Gehrmann, Y. Jernite, J. Novikova, & L. Perez-Beltrachini (Eds.), Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM) (pp. 316–325). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.gem-1.27

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., & Liu, T. (2023). A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions (arXiv:2311.05232). arXiv. https://doi.org/10.48550/arXiv.2311.05232

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C., Manning, C. D., Ré, C., Acosta-Navas, D., Hudson, D. A., … Koreeda, Y. (2023). Holistic Evaluation of Language Models (arXiv:2211.09110). arXiv. https://doi.org/10.48550/arXiv.2211.09110

Nori, H., Lee, Y. T., Zhang, S., Carignan, D., Edgar, R., Fusi, N., King, N., Larson, J., Li, Y., Liu, W., Luo, R., McKinney, S. M., Ness, R. O., Poon, H., Qin, T., Usuyama, N., White, C., & Horvitz, E. (2023). Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine (arXiv:2311.16452). arXiv. https://doi.org/10.48550/arXiv.2311.16452

Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: A Method for Automatic Evaluation of Machine Translation. In P. Isabelle, E. Charniak, & D. Lin (Eds.), Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (pp. 311–318). Association for Computational Linguistics. https://doi.org/10.3115/1073083.1073135

Srivastava, B., Lakkaraju, K., Koppel, T., Narayanan, V., Kundu, A., & Joshi, S. (2023). Evaluating Chatbots to Promote Users’ Trust—Practices and Open Problems (arXiv:2309.05680). arXiv. http://arxiv.org/abs/2309.05680

van der Lee, C., Gatt, A., Van Miltenburg, E., Wubben, S., & Krahmer, E. (2019). Best practices for the human evaluation of automatically generated text. Proceedings of the 12th International Conference on Natural Language Generation, 355–368.

Yamkovenko, S. (2023, May 1). Sal Khan’s 2023 TED Talk: AI in the classroom can transform education. Khan Academy Blog. https://blog.khanacademy.org/sal-khans-2023-ted-talk-ai-in-the-classroom-can-transform-education/

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). BERTScore: Evaluating Text Generation with BERT (arXiv:1904.09675). arXiv. https://doi.org/10.48550/arXiv.1904.09675

Comments (0)Number of views (141)
title of plugged in news

The Summer 2024 NAEP Data Training Workshop - Applications Open


Applications are now open for the summer 2024 NAEP Data Training Workshop! This workshop is for quantitative researchers with strong statistical skills who are interested in conducting data analyses using NAEP data. For the first time, participants in this year's training will get an introduction to COVID data collections. Learn more here!

EdSurvey e-book now available!


Analyzing NCES Data Using EdSurvey: A User's Guide is now available for input from the research community online here.  Check it out and give the team your feedback.

«April 2024»