Abstract
Essay writing tests, integral inmany educational settings, demand significant resources for manual scoring. Automated essay scoring (AES) can alleviate this by automating the process, thereby reducing human effort. However, the multitude of AES models, each varying in its features and scoring approaches, complicates selecting one optimal model, especially when evaluating diverse content-related aspects across multiple rating items. Therefore, we propose a hierarchical rater model-based approach to integrate predictions from multiple AES models, accounting for their distinct scoring behaviors. We investigated its performance on data from a university essay writing test. The proposed method achieved accuracy that was comparable to the best individual AES model. This is a promising result because it additionally reduced the amount of differential item functioning between human and automated scoring and thus established a higher degree of measurement invariance compared to the individual AES models.
Original language | English |
---|---|
Pages (from-to) | 209-218 |
Number of pages | 10 |
Journal | Zeitschrift fur Psychologie / Journal of Psychology |
Volume | 232 |
Issue number | 3 |
Early online date | 12 Jul 2024 |
DOIs | |
Publication status | Published - Jul 2024 |
Keywords
- automated essay scoring
- formative assessment
- hierarchical rater model
- natural language processing
- transformer models