Speaker
Description
This study aims to explore whether different prompting designs produce significant differences in ChatGPT5.0-generated scores for IELTS Writing Task 2 essays and to examine to what extent ChatGPT5.0’s essay scorings are aligned with human ratings. Using a dataset of 56 essays, scores generated under two prompting designs (with or without calibration examples) were compared with each other and with a human benchmark derived from multiple raters. Scores were analyzed across four IELTS criteria (Task Response, Coherence and Cohesion, Lexical Resource, and Grammatical Range and Accuracy) as well as overall performance using descriptive statistics, repeated measures ANOVA, correlation, and intraclass correlation coefficients (ICC). Findings revealed that prompt design, i.e., inclusion of calibration examples, influenced automated essay scoring capabilities of ChatGPT since the scores significantly differed between the two prompting scenarios. While ChatGPT and human scores showed insignificant variation at the overall level, systematic discrepancies across the three marking criteria were detected, with the exception to Task Response. Specifically, ChatGPT tended to assign higher scores for Coherence and Cohesion and lower scores for lexical and grammatical aspects. Pearson Correlation analyses found moderate to strong relationships between ChatGPT-based scores and human ratings, suggesting ChatGPT could reliably rank writing performance whereas lower intraclass correlation coefficients signified a weaker alignment in terms of absolute scorings.
Keywords: ChatGPT5.0, IELTS Writing, reliability