Published on Friday, September 6, 2024

Research First Look: Can Large Language Models Transform Automated Scoring Further?

This week, the NAEP R&D Hub is able to offer yet another sneak peek at some upcoming research from experts in large-scale assessment. Authored by Ruhan Circi, a NAEP process data researcher from the American Institutes for Research, alongside Maggie Perkoff, one of our 2024 summer doctoral interns, this working paper was produced as part of the Summer 2024 NAEP Doctoral Student Internship Program AI focus area. The authors also wish to thank Bhashithe Abeysinghe for his review of the working paper. It presents insights from a literature review on using LLMs for automated scoring of constructed response items, briefly explores how advancements in LLMs can improve current scoring systems, and highlights the remaining challenges and areas for further research. You can learn more about the NAEP Doctoral Student Internship Program and its new AI focus area here; be sure to subscribe to stay tuned for detailed coverage, analyses, and findings in the final version of the working paper when it becomes public, as well as all kinds of exciting updates and opportunities in NAEP research.

Can Large Language Models Transform Automated Scoring Further? Insights for Constructed Response Mathematics Items

The transformative potential of large language models (LLMs) spans across various applications in education, including automated scoring practices. This working paper focuses on how the emerging wave of LLMs could improve automated scoring of constructed response mathematics items.

Efforts to maximize the potential of automated scoring are not new, dating back to the mid-1960s (Martinez & Bennett, 1992). Over time, these efforts have grown alongside technological advances, emerging digital systems, and methods used (e.g. Williamson et al., 2012). Methods have evolved from programs that can execute a series of hand-written rules to extract mathematical expressions and now extract answers from blocks of text (Baral et al., 2021; Shen et al., 2021). Over the last three decades, automated scoring has seen extensive development in the various content domains, with broad applications in literacy (e.g. scoring of essays as in Ramesh & Sanampudi, 2022), science (e.g. Lee et al., 2024), reading (e.g. Fernandez et al., 2022), and mathematics (e.g. Lee et al., 2024).

Assessment is one of the critical components of the learning process, and constructed-response items are considered highly effective for assessing students’ subject knowledge and retention (Stankous, 2016). However, the variability in responses makes grading these items labor-intensive (e.g. Baral et al., 2021). In large-scale assessments, scoring open-ended responses requires significant time and financial resources when done manually (Wind, 2019). This has increased the efforts to develop automated systems for scoring open-ended mathematics items. Many recent efforts, particularly in the last decade, have employed a combination of Natural Language Processing (NLP) and machine learning techniques of varying complexity for automated scoring. Current approaches primarily rely on supervised learning, where classifiers are trained, or language models that are fine-tuned on a limited set of responses labeled with human-provided scores.

With advancements in LLM technology enhancing automatic scoring in other subjects, these models are well-positioned to address these challenges in mathematics with students’ written responses. Recently, there has been emerging work applying these techniques in the context of mathematical content (Baral et al., 2021; Morris et al., 2024). Still, integrating LLMs into automated mathematics scoring systems effectively and transparently remains a complex task.

Automated Scoring of Constructed Response Mathematics Items

Automated scoring (AS) systems are software solutions designed to evaluate student responses to a given item/task in a content area, such as mathematics. These systems typically comprise multiple components: a user interface for data input, a data repository for storing responses and scores, and a backend component responsible for generating scores and evaluating the outputs. Our focus is on the score generation component—the core of an AS system—which processes a student's response to a given assessment item and outputs a score. For this discussion, we define the AS process as encompassing the entire design and implementation of the score generation mechanism. The specifics of this process can vary significantly depending on factors such as the subject domain, the item types, the models employed, and the project’s overall scope.

To illustrate the architecture of a machine-learning-based automated scoring process, we can break it down into three high-level components: 1) data selection and preprocessing, 2) feature embedding, and 3) applying a scoring strategy. Each of these components is crucial and can be independently modified or optimized to enhance the accuracy, transparency, and fairness of the scoring system.

A diagram of a software

Description automatically generated with medium confidence

Figure 1. The main components of an automated scoring process

Data selection and preprocessing are crucial in automated scoring systems, as the quality of input data directly affects model performance. Traditionally, data is sorted to match specific applications based on criteria like question difficulty, response scores, and content. Recent studies, such as Jung et al. (2022), have shown that using Item Response Theory (IRT) scores aligned with human annotations can improve model accuracy, though this is less effective for challenging items with score discrepancies. When data is limited, techniques like oversampling or synthetic data generation, such as pseudo-labeling, can help, but only in controlled quantities to avoid performance issues (Nakamoto et al., 2023). The amount of available data may also impact model choice, as data-driven models require a sufficient quantity of human-scored answers. Conversely, with only rubrics or small reference samples, models rely on calculating semantic similarity between reference and test answers (Yoon, 2023).

Preprocessing steps, such as removing formatting, standardizing text, and tokenization, are used to prepare data for scoring models. Advanced preprocessing, including spelling correction and resizing for non-textual data like images, enhances feature extraction and model accuracy (Jung et al., 2022; Tyack et al., 2024). The type of data determines the preprocessing approach, ensuring consistency and reducing variability.

Feature engineering is used for converting text-based responses into formats suitable for model processing. Traditional automated scoring systems often use engineered features that capture specific elements of text, such as individual word tokens (Woods et al., 2017) or language proxies, or they derive features through unsupervised methods like Latent Semantic Analysis (LSA; Deerwester et al., 1990) or Latent Dirichlet Allocation (LDA; Blei et al., 2003) to capture deeper semantic patterns. More recent systems employ feature embeddings, where preprocessed responses are mapped to a vocabulary and semantic space. Embeddings such as Word2Vec (Mikolov et al., 2013), BERT (Devlin et al., 2019), ELMo (Peters et al., 2018), and GPT-3 (Brown et al., 2020) differ in training data, tokenization, model architecture, and training tasks. These embeddings capture contextual relationships between tokens, allowing models to understand nuanced meanings and improve the accuracy and reliability of automated scoring outcomes. MathBERT embeddings (Shen 2021) are a variation of the traditional BERT method that uses custom math vocabulary with the BertWordPieceTokenizer. They have been proven highly effective for downstream tasks including open-ended question answer scoring.

Scoring strategies in automated scoring systems involve various machine learning and deep learning architectures, such as Recurrent Neural Networks (RNNs) and their variants (Erickson et al., 2020), as well as large pre-trained language models like GPT-2, GPT-3, and BERT (Shen et al., 2021). As noted by Latif and Zhai (2024), many scoring models based on LLMs utilize BERT as their foundation. Approaches to adapting pre-trained language models for scoring tasks generally involve either fine-tuning the models on specific datasets or employing prompting strategies to guide the model's responses without extensive retraining. For example, Senanayake and Asanka (2024) developed algorithms and fine-tuned prompts to assess short answer responses across various subjects using both LLMs and machine learning techniques.

Prompting strategies, which adapt pre-trained models to downstream tasks without retraining, have also shown effectiveness. Providing models with explicit details such as scoring rubrics, instructor notes, and example student responses has improved model performance. Baral et al. (2024) explored a zero-shot prompting strategy with GPT-4 for scoring open-ended mathematics responses, wherein they provided the model with the problem, the student’s answer, and a scoring rubric. Insights from the NAEP 2023 mathematics assessment challenge also highlighted that incorporating additional contextual information enhances scoring accuracy. In that competition, teams improved their models by leveraging the structure of items, using augmented paraphrases, and integrating process data (Whitmer et al., 2023). Baral et al. (2024) also compared a fine-tuned LLM derived from Mistral with a non-generative model currently used for automated assessment. This comparison underscores the evolving capabilities of LLMs like GPT-4 in handling complex open-ended responses for assessment.

Evaluation of Scoring Systems

Selecting the best-performing model in automated scoring systems involves comparing the model's output with human scores (Lottridge et al., 2020). The choice of evaluation metrics can vary depending on the specific application and objectives. Most systems use key metrics like Quadratic Weighted Kappa (QWK) to assess the agreement between human- and system-generated scores. Other common metrics for evaluating model accuracy and consistency include Mean Squared Error (MSE), Standardized Mean Difference (SMD), Area Under the Curve (AUC), Root Mean Squared Error (RMSE), Kullback–Leibler divergence measure, and multi-class Cohen’s Kappa (please see Rotou & Rupp, 2020, for key statistics).

Conclusion

LLMs have shown promising results in automated scoring across various subject domains and are now being applied to constructed response math items, with initial studies indicating positive outcomes. However, significant challenges remain for large-scale mathematics assessments, prompting researchers and practitioners to explore further solutions. Automated scoring systems often struggle with accurately scoring more difficult items, items with partial credit options, and items with high variance in student responses (Whitmer et al., 2023; Jung, 2022). Additionally, there are concerns about bias in training data, as models can replicate the biases present in human-scored samples. To address this, strategies such as evaluating bias before selecting training data and using an ensemble of models of individual scorer behavior are recommended (Lottridge & Young, 2022; Zhang et al., 2023). Moreover, it is essential to assess scoring systems on matched subgroups rather than just averages to better understand potential biases. While LLMs offer significant potential to improve automated scoring by enhancing sparse data, extracting high-quality features, and providing transparent scoring rationales, more research is needed to achieve robust, accurate, and unbiased performance across diverse question types and student populations.

References

Baral, S., Botelho, A. F., Erickson, J. A., Benachamardi, P., & Heffernan, N. T. (2021). Improving Automated Scoring of Student Open Responses in Mathematics. International Educational Data Mining Society. https://eric.ed.gov/?id=ED615565

Baral, S., Worden, E., Lim, W.-C., Luo, Z., Santorelli, C., & Gurung, A. (2024). Automated Assessment in Math Education: A Comparative Analysis of LLMs for Open-Ended Responses. 732–737. https://doi.org/10.5281/zenodo.12729932

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. arXiv:2005.14165 [Cs]. http://arxiv.org/abs/2005.14165

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [Cs]. http://arxiv.org/abs/1810.04805

Erickson, J. A., Botelho, A. F., McAteer, S., Varatharaj, A., & Heffernan, N. T. (2020). The automated grading of student open responses in mathematics. Proceedings of the Tenth International Conference on Learning Analytics & Knowledge, 615–624. https://doi.org/10.1145/3375462.3375523

Fernandez, N., Ghosh, A., Liu, N., Wang, Z., Choffin, B., Baraniuk, R., & Lan, A. (2022). Automated Scoring for Reading Comprehension via In-context BERT Tuning. In M. M. Rodrigo, N. Matsuda, A. I. Cristea, & V. Dimitrova (Eds.), Artificial Intelligence in Education (pp. 691–697). Springer International Publishing. https://doi.org/10.1007/978-3-031-11644-5_69

Jung, J. Y., Tyack, L., & von Davier, M. (2022). Automated scoring of constructed-response items using artificial neural networks in international large-scale assessment. Psychological Test and Assessment Modeling, 64(4), 471–494.

Latif, E., & Zhai, X. (2024). Fine-tuning ChatGPT for automatic scoring. Computers and Education: Artificial Intelligence, 6, 100210. https://doi.org/10.1016/j.caeai.2024.100210

Lee, G.-G., Latif, E., Wu, X., Liu, N., & Zhai, X. (2024). Applying large language models and chain-of-thought for automatic scoring. Computers and Education: Artificial Intelligence, 6, 100213. https://doi.org/10.1016/j.caeai.2024.100213

Lottridge, S., Godek, B., Jafari, A., & Patel, M. (2020). Comparing the Robustness of Deep Learning and Classical Automated Scoring Approaches to Gaming Strategies.

Lottridge, S., & Young, M. (2022). Examining bias in automated scoring of reading comprehension items. Annual Meeting of the National Council on Measurement in Education.

Martinez, M. E., & Bennett, R. E. (1992). A review of automatically scorable constructed-response item types for large-scale assessment. ETS Research Report Series, 1992(2), i–34.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 3111–3119.

Morris, W., Holmes, L., Choi, J. S., & Crossley, S. (2024). Automated Scoring of Constructed Response Items in Math Assessment Using Large Language Models. International Journal of Artificial Intelligence in Education. https://doi.org/10.1007/s40593-024-00418-w

Nakamoto, R., Flanagan, B., Yamauchi, T., Dai, Y., Takami, K., & Ogata, H. (2023). Enhancing Automated Scoring of Math Self-Explanation Quality Using LLM-Generated Datasets: A Semi-Supervised Approach. Computers, 12(11), Article 11. https://doi.org/10.3390/computers12110217

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep Contextualized Word Representations. In M. Walker, H. Ji, & A. Stent (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (pp. 2227–2237). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-1202

Ramesh, D., & Sanampudi, S. K. (2022). An automated essay scoring systems: A systematic literature review. Artificial Intelligence Review, 55(3), 2495–2527. https://doi.org/10.1007/s10462-021-10068-2

Rotou, O., & Rupp, A. A. (2020). Evaluations of Automated Scoring Systems in Practice. ETS Research Report Series, 2020(1), 1–18. https://doi.org/10.1002/ets2.12293

Senanayake, C., & Asanka, D. (2024). Rubric Based Automated Short Answer Scoring using Large Language Models (LLMs). 2024 International Research Conference on Smart Computing and Systems Engineering (SCSE), 7, 1–6. https://doi.org/10.1109/SCSE61872.2024.10550624

Shen, J. T., Yamashita, M., Prihar, E., Heffernan, N., Wu, X., Graff, B., & Lee, D. (2021, June 2). MathBERT: A Pre-trained Language Model for General NLP Tasks in Mathematics Education. arXiv.Org. https://arxiv.org/abs/2106.07340v5

Stankous, N. V. (2016). Constructive response vs. Multiple-choice tests in math: American experience and discussion. 2nd Pan-American Interdisciplinary Conference, PIC 2016 24-26 February, Buenos Aires Argentina, 321.

Tyack, L., Khorramdel, L., & von Davier, M. (2024). Using convolutional neural networks to automatically score eight TIMSS 2019 graphical response items. Computers and Education: Artificial Intelligence, 6, 100249. https://doi.org/10.1016/j.caeai.2024.100249

Whitmer, J., Beiting-Parrish, M., Blankenship, C., Folwer-Dawson, A., & Pitcher, M. (2023). NAEP Math Item Automated Scoring Data Challenge Results: High Accuracy and Potential for Additional Insights.

Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A Framework for Evaluation and Use of Automated Scoring. Educational Measurement: Issues and Practice, 31(1), 2–13. https://doi.org/10.1111/j.1745-3992.2011.00223.x

Wind, S. A. (2019). Examining the impacts of rater effects in performance assessments. Applied Psychological Measurement, 43(2), 159–171.

Woods, B., Adamson, D., Miel, S., & Mayfield, E. (2017). Formative Essay Feedback Using Predictive Scoring Models. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2071–2080. https://doi.org/10.1145/3097983.3098160

Yoon, S.-Y. (2023, May 29). Short Answer Grading Using One-shot Prompting and Text Similarity Scoring Model. arXiv.Org. https://arxiv.org/abs/2305.18638v1

Zhang, M., Heffernan, N., & Lan, A. (2023, June 1). Modeling and Analyzing Scorer Preferences in Short-Answer Math Questions. arXiv.Org. https://arxiv.org/abs/2306.00791v1

Comments (0)Number of views (1011)

R&D Hub

Research First Look: Can Large Language Models Transform Automated Scoring Further?

More links

Categories

Apply Now: Summer 2025 NAEP Doctoral Student Internship Program

EdSurvey 4.0.7 Released!

Connect

Opportunities

Featured Work

R&D Hub

Research First Look: Can Large Language Models Transform Automated Scoring Further?

More links

Categories

Apply Now: Summer 2025 NAEP Doctoral Student Internship Program

EdSurvey 4.0.7 Released!

Tags