The NAEP R&D Hub is often able to give readers in the NAEP research community a sneak peek at upcoming research from experts in the field. This week, we’re happy to share some research work from NAEP researchers Ruhan Circi and Bhashithe Abeysinghe on passage text difficulty and language models, soon to become a white paper on the topic, “A Voice from Past: Passage Text Difficulty in the Hive of Language Models.” Check out the full text of the preview below and remember to subscribe to stay connected with other exciting research and opportunities to get involved!
A Voice from Past: Passage Text Difficulty in the Hive of Language Models
Introduction
Automated item/passage generation for assessment development offers several key benefits: (a) reduced item generation time, (b) lowered costs for creating items, (c) support for continuous and rapid item development to maintain large item pools, and (d) tailored items for customized measurement and learning needs. This line of research and operational work has been available for a long time (e.g., Circi et al., 2023). With the introduction of large pre-trained language models such as GPT-2, 3.5, 4o, and BERT (e.g., Attali et al., 2022), there is promise to fulfill the potential of automated item generation in a more effective and cost-efficient manner, and the field is now experimenting with large language models at a faster pace than ever.
For reading passage and question generation, while language models offer a fast pace, they need human supervision over appropriateness of the content and difficulty for the readers (e.g., grade levels). Special attention is needed on text/passage complexity for reading assessments to maintain a desired level of comparability at the intended level. Various metrics exist to calculate passage difficulty: both open-access (Coleman & Liau, 1975; Flesch, 1979; Fletcher, 2006; Kincaid et al., 1975; Kincaid & Delionbach, 1973; Londoner, 1967; Stenner, 1996) and closed access (Attali & Burstein, 2006; Hwang & Utami, 2024).
In this research preview, we briefly discuss the potential and limitations of existing metrics and our attempt to combine these metrics into a representative model that comprehensively represents passage difficulty.
Context for Text Difficulty: Why It Is Needed
Engaging readers with texts of appropriate complexity is essential for effective reading comprehension. Specifically, text difficulty plays a crucial role in standardized reading assessments, ensuring comparability across different grade levels. Text difficulty can be defined in various different ways but typically refers to vocabulary, sentence complexity, and organization of text, among other factors (Davidson, 2013; Kincaid & Delionbach, 1973; Londoner, 1967; McNamara & Graesser, 2011; Morris, 2017).
Most reading difficulty metrics were developed decades ago, with some dating back to the 1950s. However, these same metrics remain relevant. Many of these metrics overlook features beyond simple word and sentence counts. The metrics included in the study intend to measure text complexity, text difficulty, or reading ease (all terms are used by the metrics). By combining qualitative (linguistic) and quantitative features, a comprehensive metric can aid in the discussion of appropriate text difficulty.
Available Measures and Related Research
There have been several metrics developed to calculate text difficulty, such as those of Flesch (1979), Haitao Liu (2008), and Stenner (1996). We observe two types of metrics: quantitative and qualitative. Quantitative metrics rely on features such as word frequencies and sentence length, (e.g., Flesch reading ease, Flesch–Kincaid Grade level, Lexile score, and Gunning Fog). Qualitative metrics, on the other hand, consider linguistic features such as lexical choices, syntactic structures, cohesion, coherence, and more (e.g., dependency distance and Spache).
Many of the prominent metrics are from the quantitative class and their implementation is straightforward. However, our experiments note that many of these metrics do not correlate well with each other (also reported in Mailloux et al., 1994). This underscores the need to account for more features to cover text difficulty comprehensively, as relying on a single type of metric can be misleading or lack the additional information provided by other metrics.
Table 1. Selected prominent metrics
* May not cover all the grade levels
** Suitable for grades below grade 4
For this research preview, we used a dataset from Project Gutenberg consisting of children's literature in text format. In Figure 1, we illustrate all the metrics that produce a grade level as the score and compare them via a correlation heatmap. This visualization shows how the features used in each metric led to variations in the results. For example, the Spache metric uses a list of words that young children should be able to read, whereas Linsear write index calculates difficult words based on the number of syllables. Flesch-Kincaid, Gunning fog, and Linsear write all use similar approaches when computing the difficulty score; which is reflected in their correlation. The most distinct metric is the Coleman-Liau index, which uses characters instead of words to compute difficulty.
Figure 1. Correlation heatmap of different metrics that often yield varying difficulty scores for the same passage
In Figure 2, the variability in how each metric scores across 100 passages is presented. The Spache metric, designed for grades below 4, consistently produces lower grade-level scores. However, its application to higher grades can result in less stable scores. Similarly, the Gunning Fog index is not defined for grades lower than 6. The Linsear Write index typically generates scores at the higher end of the spectrum, indicating a tendency to overestimate passage difficulty compared to other metrics.
Figure 2. How various metrics score a selected sample of 100 passages in the Project Gutenberg dataset
While each of these metrics measures relevant aspects of text difficulty within their specified contexts, it is clear that none of them individually provides a complete and comprehensive score, and their use also requires careful consideration.
In our work, we utilize features from the aforementioned metrics to create a composite passage difficulty model. Our approach involves collecting features from all selected difficulty metrics, allowing us to address both quantitative and qualitative aspects of difficulty, as well as combining different levels of features (e.g., number of words or number of syllables). The creation of these features allows us to train models with a comprehensive set of features. However, one operational hurdle is the lack of curated grade-level annotations for our dataset to explore and verify the outcomes.
Our analysis includes scaling of the features to a common range, as the selected features can vary significantly. For example, the number of words could range from 50 to 6,000,000, while the ratio of difficult words to all unique words is always between 0 and 1. To prevent the model from being overly sensitive to these large numbers, we normalize all values to the same range. After this, we compute the Uniform Manifold Approximation (UMAP) (Becht et al., 2019), which allows us to better visualize and cluster the features. Finally, we compute clusters using the MeanShift algorithm (Georgescu et al., 2003), resulting in seven clusters. Further investigation reveals a silhouette score of 0.42 (Shahapure & Nicholas, 2020) indicating that we have created favorable clusters with minimal variance within them.
Figure 3. Clustering via UMAP embeddings
In conclusion, examining the potential of a single model that integrates various metrics for a comprehensive measure of passage difficulty is highly valuable. This approach can contribute to the works enhancing the use of language models to generate texts and items for educational assessment purposes. Stay tuned for our upcoming research results, which will include reading passages from large-scale assessments and our exploration and verification findings.
References
Attali, Y., & Burstein, J. (2006). Automated Essay Scoring With e-rater® V.2. The Journal of Technology, Learning and Assessment, 4(3), Article 3. https://ejournals.bc.edu/index.php/jtla/article/view/1650
Attali, Y., Runge, A., LaFlair, G. T., Yancey, K., Goodwin, S., Park, Y., & Von Davier, A. A. (2022). The interactive reading task: Transformer-based automatic item generation. Frontiers in Artificial Intelligence, 5, 903077.
Circi, R., Hicks, J., & Sikali, E. (2023). Automatic item generation: Foundations and machine learning-based approaches for assessments. Frontiers in Education, 8, 858273. https://doi.org/10.3389/feduc.2023.858273
Coleman, M., & Liau, T. L. (1975). A computer readability formula designed for machine scoring. Journal of Applied Psychology, 60(2), 283–284. https://doi.org/10.1037/h0076540
Dale, E., & Chall, J. S. (1948). A Formula for Predicting Readability: Instructions. Educational Research Bulletin, 27(2), 37–54.
Davidson, M. (2013). Books that children can read. Decodable books and book leveling.
Eltorai, A. E. M., Naqvi, S. S., Ghanian, S., Eberson, C. P., Weiss, A.-P. C., Born, C. T., & Daniels, A. H. (2015). Readability of Invasive Procedure Consent Forms. Clinical and Translational Science, 8(6), 830–833. https://doi.org/10.1111/cts.12364
Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology, 32(3), 221–233. https://doi.org/10.1037/h0057532
Flesch, R. (1979). How to write plain English. University of Canterbury. Available at Http://Www. Mang. Canterbury. Ac. Nz/Writing_guide/Writing/Flesch. Shtml.[Retrieved 5 February 2016].
Fletcher, J. M. (2006). Measuring Reading Comprehension. Scientific Studies of Reading, 10(3), 323–330. https://doi.org/10.1207/s1532799xssr1003_7
Georgescu, B., Shimshoni, I., & Meer, P. (2003). Mean shift based clustering in high dimensions: A texture classification example. 456–463.
Gunning, R. (2004). Plain language at work newsletter.
Haitao Liu. (2008). Dependency Distance as a Metric of Language Comprehension Difficulty. Journal of Cognitive Science, 9(2), 159–191. https://doi.org/10.17791/JCS.2008.9.2.159
Hwang, W.-Y., & Utami, I. Q. (2024). Using GPT and authentic contextual recognition to generate math word problems with difficulty levels. Education and Information Technologies. https://doi.org/10.1007/s10639-024-12537-x
Kincaid, J. P., & Delionbach, L. J. (1973). Validation of the Automated Readability Index: A Follow-Up. Human Factors, 15(1), 17–20. https://doi.org/10.1177/001872087301500103
Kincaid, J. P., Fishburne Jr, R. P., Rogers, R. L., & Chissom, B. S. (1975). Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Naval Technical Training Command Millington TN Research Branch.
Londoner, C. A. (1967). A READABILITY ANALYSIS OF RANDOMLY SELECTED BASIC EDUCATION AND VOCATIONAL EDUCATION CURRICULUM MATERIALS USED AT THE ATTERBURY JOB CORPS CENTER AS MEASURED BY THE GUNNING FOG INDEX.
Mailloux, S., Johnson, M., Fisher, D., & Pettibone, T. (1994). How reliable is computerized assessment of readability? Computers in Nursing, 13, 221–225.
McNamara, D., & Graesser, A. (2011). Coh-Metrix: An Automated Tool for Theoretical and Applied Natural Language Processing. Applied Natural Language Processing and Content Analysis: Identification, Investigation, and Resolution, 188–205. https://doi.org/10.4018/978-1-60960-741-8.ch011
Morris, D. (2017). The Howard Street Tutoring Manual, Second Edition: Teaching At-Risk Readers in the Primary Grades. Guilford Publications.
Shahapure, K. R., & Nicholas, C. (2020). Cluster Quality Analysis Using Silhouette Score. 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), 747–748. https://doi.org/10.1109/DSAA49011.2020.00096
Spache, G. (1953). A New Readability Formula for Primary-Grade Reading Materials. The Elementary School Journal, 53(7), 410–413. https://doi.org/10.1086/458513
Stenner, A. J. (1996). Measuring Reading Comprehension with the Lexile Framework. 31.