Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, automated essay scoring.

26 papers with code • 1 benchmarks • 1 datasets

Essay scoring: Automated Essay Scoring is the task of assigning a score to an essay, usually in the context of assessing the language ability of a language learner. The quality of an essay is affected by the following four primary dimensions: topic relevance, organization and coherence, word usage and sentence complexity, and grammar and mechanics.

Source: A Joint Model for Multimodal Document Quality Assessment

Benchmarks Add a Result

--> -->
Trend Dataset Best ModelPaper Code Compare
Tran-BERT-MS-ML-R

Most implemented papers

Automated essay scoring based on two-stage learning.

Current state-of-art feature-engineered and end-to-end Automated Essay Score (AES) methods are proven to be unable to detect adversarial samples, e. g. the essays composed of permuted sentences and the prompt-irrelevant essays.

A Neural Approach to Automated Essay Scoring

nusnlp/nea • EMNLP 2016

SkipFlow: Incorporating Neural Coherence Features for End-to-End Automatic Text Scoring

automated essay grading

Our new method proposes a new \textsc{SkipFlow} mechanism that models relationships between snapshots of the hidden representations of a long short-term memory (LSTM) network as it reads.

Neural Automated Essay Scoring and Coherence Modeling for Adversarially Crafted Input

Youmna-H/Coherence_AES • NAACL 2018

We demonstrate that current state-of-the-art approaches to Automated Essay Scoring (AES) are not well-suited to capturing adversarially crafted input of grammatical but incoherent sequences of sentences.

Co-Attention Based Neural Network for Source-Dependent Essay Scoring

This paper presents an investigation of using a co-attention based neural network for source-dependent essay scoring.

Language models and Automated Essay Scoring

In this paper, we present a new comparative study on automatic essay scoring (AES).

Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems

midas-research/calling-out-bluff • 14 Jul 2020

This number is increasing further due to COVID-19 and the associated automation of education and testing.

Prompt Agnostic Essay Scorer: A Domain Generalization Approach to Cross-prompt Automated Essay Scoring

Cross-prompt automated essay scoring (AES) requires the system to use non target-prompt essays to award scores to a target-prompt essay.

Many Hands Make Light Work: Using Essay Traits to Automatically Score Essays

To find out which traits work best for different types of essays, we conduct ablation tests for each of the essay traits.

EXPATS: A Toolkit for Explainable Automated Text Scoring

octanove/expats • 7 Apr 2021

Automated text scoring (ATS) tasks, such as automated essay scoring and readability assessment, are important educational applications of natural language processing.

ORIGINAL RESEARCH article

Explainable automated essay scoring: deep learning really has pedagogical value.

\r\nVivekanandan Kumar

  • School of Computing and Information Systems, Faculty of Science and Technology, Athabasca University, Edmonton, AB, Canada

Automated essay scoring (AES) is a compelling topic in Learning Analytics for the primary reason that recent advances in AI find it as a good testbed to explore artificial supplementation of human creativity. However, a vast swath of research tackles AES only holistically; few have even developed AES models at the rubric level, the very first layer of explanation underlying the prediction of holistic scores. Consequently, the AES black box has remained impenetrable. Although several algorithms from Explainable Artificial Intelligence have recently been published, no research has yet investigated the role that these explanation models can play in: (a) discovering the decision-making process that drives AES, (b) fine-tuning predictive models to improve generalizability and interpretability, and (c) providing personalized, formative, and fine-grained feedback to students during the writing process. Building on previous studies where models were trained to predict both the holistic and rubric scores of essays, using the Automated Student Assessment Prize’s essay datasets, this study focuses on predicting the quality of the writing style of Grade-7 essays and exposes the decision processes that lead to these predictions. In doing so, it evaluates the impact of deep learning (multi-layer perceptron neural networks) on the performance of AES. It has been found that the effect of deep learning can be best viewed when assessing the trustworthiness of explanation models. As more hidden layers were added to the neural network, the descriptive accuracy increased by about 10%. This study shows that faster (up to three orders of magnitude) SHAP implementations are as accurate as the slower model-agnostic one. It leverages the state-of-the-art in natural language processing, applying feature selection on a pool of 1592 linguistic indices that measure aspects of text cohesion, lexical diversity, lexical sophistication, and syntactic sophistication and complexity. In addition to the list of most globally important features, this study reports (a) a list of features that are important for a specific essay (locally), (b) a range of values for each feature that contribute to higher or lower rubric scores, and (c) a model that allows to quantify the impact of the implementation of formative feedback.

Automated essay scoring (AES) is a compelling topic in Learning Analytics (LA) for the primary reason that recent advances in AI find it as a good testbed to explore artificial supplementation of human creativity. However, a vast swath of research tackles AES only holistically; only a few have even developed AES models at the rubric level, the very first layer of explanation underlying the prediction of holistic scores ( Kumar et al., 2017 ; Taghipour, 2017 ; Kumar and Boulanger, 2020 ). None has attempted to explain the whole decision process of AES, from holistic scores to rubric scores and from rubric scores to writing feature modeling. Although several algorithms from XAI (explainable artificial intelligence) ( Adadi and Berrada, 2018 ; Murdoch et al., 2019 ) have recently been published (e.g., LIME, SHAP) ( Ribeiro et al., 2016 ; Lundberg and Lee, 2017 ), no research has yet investigated the role that these explanation models (trained on top of predictive models) can play in: (a) discovering the decision-making process that drives AES, (b) fine-tuning predictive models to improve generalizability and interpretability, and (c) providing teachers and students with personalized, formative, and fine-grained feedback during the writing process.

One of the key anticipated benefits of AES is the elimination of human bias such as rater fatigue, rater’s expertise, severity/leniency, scale shrinkage, stereotyping, Halo effect, rater drift, perception difference, and inconsistency ( Taghipour, 2017 ). At its turn, AES may suffer from its own set of biases (e.g., imperfections in training data, spurious correlations, overrepresented minority groups), which has incited the research community to look for ways to make AES more transparent, accountable, fair, unbiased, and consequently trustworthy while remaining accurate. This required changing the perception that AES is merely a machine learning and feature engineering task ( Madnani et al., 2017 ; Madnani and Cahill, 2018 ). Hence, researchers have advocated that AES should be seen as a shared task requiring several methodological design decisions along the way such as curriculum alignment, construction of training corpora, reliable scoring process, and rater performance evaluation, where the goal is to build and deploy fair and unbiased scoring models to be used in large-scale assessments and classroom settings ( Rupp, 2018 ; West-Smith et al., 2018 ; Rupp et al., 2019 ). Unfortunately, although these measures are intended to design reliable and valid AES systems, they may still fail to build trust among users, keeping the AES black box impenetrable for teachers and students.

It has been previously recognized that divergence of opinion among human and machine graders has been only investigated superficially ( Reinertsen, 2018 ). So far, researchers investigated the characteristics of essays through qualitative analyses which ended up rejected by AES systems (requiring a human to score them) ( Reinertsen, 2018 ). Others strived to justify predicted scores by identifying essay segments that actually caused the predicted scores. In spite of the fact that these justifications hinted at and quantified the importance of these spatial cues, they did not provide any feedback as to how to improve those suboptimal essay segments ( Mizumoto et al., 2019 ).

Related to this study and the work of Kumar and Boulanger (2020) is Revision Assistant, a commercial AES system developed by Turnitin ( Woods et al., 2017 ; West-Smith et al., 2018 ), which in addition to predicting essays’ holistic scores provides formative, rubric-specific, and sentence-level feedback over multiple drafts of a student’s essay. The implementation of Revision Assistant moved away from the traditional approach to AES, which consists in using a limited set of features engineered by human experts representing only high-level characteristics of essays. Like this study, it rather opted for including a large number of low-level writing features, demonstrating that expert-designed features are not required to produce interpretable predictions. Revision Assistant’s performance was reported on two essay datasets, one of which was the Automated Student Assessment Prize (ASAP) 1 dataset. However, performance on the ASAP dataset was reported in terms of quadratic weighted kappa and this for holistic scores only. Models predicting rubric scores were trained only with the other dataset which was hosted on and collected through Revision Assistant itself.

In contrast to feature-based approaches like the one adopted by Revision Assistant, other AES systems are implemented using deep neural networks where features are learned during model training. For example, Taghipour (2017) in his doctoral dissertation leverages a recurrent neural network to improve accuracy in predicting holistic scores, implement rubric scoring (i.e., organization and argument strength), and distinguish between human-written and computer-generated essays. Interestingly, Taghipour compared the performance of his AES system against other AES systems using the ASAP corpora, but he did not use the ASAP corpora when it came to train rubric scoring models although ASAP provides two corpora provisioning rubric scores (#7 and #8). Finally, research was also undertaken to assess the generalizability of rubric-based models by performing experiments across various datasets. It was found that the predictive power of such rubric-based models was related to how much the underlying feature set covered a rubric’s criteria ( Rahimi et al., 2017 ).

Despite their numbers, rubrics (e.g., organization, prompt adherence, argument strength, essay length, conventions, word choices, readability, coherence, sentence fluency, style, audience, ideas) are usually investigated in isolation and not as a whole, with the exception of Revision Assistant which provides feedback at the same time on the following five rubrics: claim, development, audience, cohesion, and conventions. The literature reveals that rubric-specific automated feedback includes numerical rubric scores as well as recommendations on how to improve essay quality and correct errors ( Taghipour, 2017 ). Again, except for Revision Assistant which undertook a holistic approach to AES including holistic and rubric scoring and provision of rubric-specific feedback at the sentence level, AES has generally not been investigated as a whole or as an end-to-end product. Hence, the AES used in this study and developed by Kumar and Boulanger (2020) is unique in that it uses both deep learning (multi-layer perceptron neural network) and a huge pool of linguistic indices (1592), predicts both holistic and rubric scores, explaining holistic scores in terms of rubric scores, and reports which linguistic indices are the most important by rubric. This study, however, goes one step further and showcases how to explain the decision process behind the prediction of a rubric score for a specific essay, one of the main AES limitations identified in the literature ( Taghipour, 2017 ) that this research intends to address, at least partially.

Besides providing explanations of predictions both globally and individually, this study not only goes one step further toward the automated provision of formative feedback but also does so in alignment with the explanation model and the predictive model, allowing to better map feedback to the actual characteristics of an essay. Woods et al. (2017) succeeded in associating sentence-level expert-derived feedback with strong/weak sentences having the greatest influence on a rubric score based on the rubric, essay score, and the sentence characteristics. While Revision Assistant’s feature space consists of counts and binary occurrence indicators of word unigrams, bigrams and trigrams, character four-grams, and part-of-speech bigrams and trigrams, they are mainly textual and locational indices; by nature they are not descriptive or self-explanative. This research fills this gap by proposing feedback based on a set of linguistic indices that can encompass several sentences at a time. However, the proposed approach omits locational hints, leaving the merging of the two approaches as the next step to be addressed by the research community.

Although this paper proposes to extend the automated provision of formative feedback through an interpretable machine learning method, it rather focuses on the feasibility of automating it in the context of AES instead of evaluating the pedagogical quality (such as the informational and communicational value of feedback messages) or impact on students’ writing performance, a topic that will be kept for an upcoming study. Having an AES system that is capable of delivering real-time formative feedback sets the stage to investigate (1) when feedback is effective, (2) the types of feedback that are effective, and (3) whether there exist different kinds of behaviors in terms of seeking and using feedback ( Goldin et al., 2017 ). Finally, this paper omits describing the mapping between the AES model’s linguistic indices and a pedagogical language that is easily understandable by students and teachers, which is beyond its scope.

Methodology

This study showcases the application of the PDR framework ( Murdoch et al., 2019 ), which provides three pillars to describe interpretations in the context of the data science life cycle: P redictive accuracy, D escriptive accuracy, and R elevancy to human audience(s). It is important to note that in a broader sense both terms “explainable artificial intelligence” and “interpretable machine learning” can be used interchangeably with the following meaning ( Murdoch et al., 2019 ): “the use of machine-learning models for the extraction of relevant knowledge about domain relationships contained in data.” Here “predictive accuracy” refers to the measurement of a model’s ability to fit data; “descriptive accuracy” is the degree at which the relationships learned by a machine learning model can be objectively captured; and “relevant knowledge” implies that a particular audience gets insights into a chosen domain problem that guide its communication, actions, and discovery ( Murdoch et al., 2019 ).

In the context of this article, formative feedback that assesses students’ writing skills and prescribes remedial writing strategies is the relevant knowledge sought for, whose effectiveness on students’ writing performance will be validated in an upcoming study. However, the current study puts forward the tools and evaluates the feasibility to offer this real-time formative feedback. It also measures the predictive and descriptive accuracies of AES and explanation models, two key components to generate trustworthy interpretations ( Murdoch et al., 2019 ). Naturally, the provision of formative feedback is dependent on the speed of training and evaluating new explanation models every time a new essay is ingested by the AES system. That is why this paper investigates the potential of various SHAP implementations for speed optimization without compromising the predictive and descriptive accuracies. This article will show how the insights generated by the explanation model can serve to debug the predictive model and contribute to enhance the feature selection and/or engineering process ( Murdoch et al., 2019 ), laying the foundation for the provision of actionable and impactful pieces of knowledge to educational audiences, whose relevancy will be judged by the human stakeholders and estimated by the magnitude of resulting changes.

Figure 1 overviews all the elements and steps encompassed by the AES system in this study. The following subsections will address each facet of the overall methodology, from hyperparameter optimization to relevancy to both students and teachers.

www.frontiersin.org

Figure 1. A flow chart exhibiting the sequence of activities to develop an end-to-end AES system and how the various elements work together to produce relevant knowledge to the intended stakeholders.

Automated Essay Scoring System, Dataset, and Feature Selection

As previously mentioned, this paper reuses the AES system developed by Kumar and Boulanger (2020) . The AES models were trained using the ASAP’s seventh essay corpus. These narrative essays were written by Grade-7 students in the setting of state-wide assessments in the United States and had an average length of 171 words. Students were asked to write a story about patience. Kumar and Boulanger’s work consisted in training a predictive model for each of the four rubrics according to which essays were graded: ideas, organization, style, and conventions. Each essay was scored by two human raters on a 0−3 scale (integer scale). Rubric scores were resolved by adding the rubric scores assigned by the two human raters, producing a resolved rubric score between 0 and 6. This paper is a continuation of Boulanger and Kumar (2018 , 2019 , 2020) and Kumar and Boulanger (2020) where the objective is to open the AES black box to explain the holistic and rubric scores that it predicts. Essentially, the holistic score ( Boulanger and Kumar, 2018 , 2019 ) is determined and justified through its four rubrics. Rubric scores, in turn, are investigated to highlight the writing features that play an important role within each rubric ( Kumar and Boulanger, 2020 ). Finally, beyond global feature importance, it is not only indispensable to identify which writing indices are important for a particular essay (local), but also to discover how they contribute to increase or decrease the predicted rubric score, and which feature values are more/less desirable ( Boulanger and Kumar, 2020 ). This paper is a continuation of these previous works by adding the following link to the AES chain: holistic score, rubric scores, feature importance, explanations, and formative feedback. The objective is to highlight the means for transparent and trustable AES while empowering learning analytics practitioners with the tools to debug these models and equip educational stakeholders with an AI companion that will semi-autonomously generate formative feedback to teachers and students. Specifically, this paper analyzes the AES reasoning underlying its assessment of the “style” rubric, which looks for command of language, including effective and compelling word choice and varied sentence structure, that clearly supports the writer’s purpose and audience.

This research’s approach to AES leverages a feature-based multi-layer perceptron (MLP) deep neural network to predict rubric scores. The AES system is fed by 1592 linguistic indices quantitatively measured by the Suite of Automatic Linguistic Analysis Tools 2 (SALAT), which assess aspects of grammar and mechanics, sentiment analysis and cognition, text cohesion, lexical diversity, lexical sophistication, and syntactic sophistication and complexity ( Kumar and Boulanger, 2020 ). The purpose of using such a huge pool of low-level writing features is to let deep learning extract the most important ones; the literature supports this practice since there is evidence that features automatically selected are not less interpretable than those engineered ( Woods et al., 2017 ). However, to facilitate this process, this study opted for a semi-automatic strategy that consisted of both filter and embedded methods. Firstly, the original ASAP’s seventh essay dataset consists of a training set of 1567 essays and a validation and testing sets of 894 essays combined. While the texts of all 2461 essays are still available to the public, only the labels (the rubric scores of two human raters) of the training set have been shared with the public. Yet, this paper reused the unlabeled 894 essays of the validation and testing sets for feature selection, a process that must be carefully carried out by avoiding being informed by essays that will train the predictive model. Secondly, feature data were normalized, and features with variances lower than 0.01 were pruned. Thirdly, the last feature of any pair of features having an absolute Pearson correlation coefficient greater than 0.7 was also pruned (the one that comes last in terms of the column ordering in the datasets). After the application of these filter methods, the number of features was reduced from 1592 to 282. Finally, the Lasso and Ridge regression regularization methods (whose combination is also called ElasticNet) were applied during the training of the rubric scoring models. Lasso is responsible for pruning further features, while Ridge regression is entrusted with eliminating multicollinearity among features.

Hyperparameter Optimization and Training

To ensure a fair evaluation of the potential of deep learning, it is of utmost importance to minimally describe this study’s exploration of the hyperparameter space, a step that is often found to be missing when reporting the outcomes of AES models’ performance ( Kumar and Boulanger, 2020 ). First, a study should list the hyperparameters it is going to investigate by testing for various values of each hyperparameter. For example, Table 1 lists all hyperparameters explored in this study. Note that L 1 and L 2 are two regularization hyperparameters contributing to feature selection. Second, each study should also report the range of values of each hyperparameter. Finally, the strategy to explore the selected hyperparameter subspace should be clearly defined. For instance, given the availability of high-performance computing resources and the time/cost of training AES models, one might favor performing a grid (a systematic testing of all combinations of hyperparameters and hyperparameter values within a subspace) or a random search (randomly selecting a hyperparameter value from a range of values per hyperparameter) or both by first applying random search to identify a good starting candidate and then grid search to test all possible combinations in the vicinity of the starting candidate’s subspace. Of particular interest to this study is the neural network itself, that is, how many hidden layers should a neural network have and how many neurons should compose each hidden layer and the neural network as a whole. These two variables are directly related to the size of the neural network, with the number of hidden layers being a defining trait of deep learning. A vast swath of literature is silent about the application of interpretable machine learning in AES and even more about measuring its descriptive accuracy, the two components of trustworthiness. Hence, this study pioneers the comprehensive assessment of deep learning impact on AES’s predictive and descriptive accuracies.

www.frontiersin.org

Table 1. Hyperparameter subspace investigated in this article along with best hyperparameter values per neural network architecture.

Consequently, the 1567 labeled essays were divided into a training set (80%) and a testing set (20%). No validation set was put aside; 5-fold cross-validation was rather used for hyperparameter optimization. Table 1 delineates the hyperparameter subspace from which 800 different combinations of hyperparameter values were randomly selected out of a subspace of 86,248,800 possible combinations. Since this research proposes to investigate the potential of deep learning to predict rubric scores, several architectures consisting of 2 to 6 hidden layers and ranging from 9,156 to 119,312 parameters were tested. Table 1 shows the best hyperparameter values per depth of neural networks.

Again, the essays of the testing set were never used during the training and cross-validation processes. In order to retrieve the best predictive models during training, every time the validation loss reached a record low, the model was overwritten. Training stopped when no new record low was reached during 100 epochs. Moreover, to avoid reporting the performance of overfit models, each model was trained five times using the same set of best hyperparameter values. Finally, for each resulting predictive model, a corresponding ensemble model (bagging) was also obtained out of the five models trained during cross-validation.

Predictive Models and Predictive Accuracy

Table 2 delineates the performance of predictive models trained previously by Kumar and Boulanger (2020) on the four scoring rubrics. The first row lists the agreement levels between the resolved and predicted rubric scores measured by the quadratic weighted kappa. The second row is the percentage of accurate predictions; the third row reports the percentages of predictions that are either accurate or off by 1; and the fourth row reports the percentages of predictions that are either accurate or at most off by 2. Prediction of holistic scores is done merely by adding up all rubric scores. Since the scale of rubric scores is 0−6 for every rubric, then the scale of holistic scores is 0−24.

www.frontiersin.org

Table 2. Rubric scoring models’ performance on testing set.

While each of these rubric scoring models might suffer from its own systemic bias and hence cancel off each other’s bias by adding up the rubric scores to derive the holistic score, this study (unlike related works) intends to highlight these biases by exposing the decision making process underlying the prediction of rubric scores. Although this paper exclusively focuses on the Style rubric, the methodology put forward to analyze the local and global importance of writing indices and their context-specific contributions to predicted rubric scores is applicable to every rubric and allows to control for these biases one rubric at a time. Comparing and contrasting the role that a specific writing index plays within each rubric context deserves its own investigation, which has been partly addressed in the study led by Kumar and Boulanger (2020) . Moreover, this paper underscores the necessity to measure the predictive accuracy of rubric-based holistic scoring using additional metrics to account for these rubric-specific biases. For example, there exist several combinations of rubric scores to obtain a holistic score of 16 (e.g., 4-4-4-4 vs. 4-3-4-5 vs. 3-5-2-6). Even though the predicted holistic score might be accurate, the rubric scores could all be inaccurate. Similarity or distance metrics (e.g., Manhattan and Euclidean) should then be used to describe the authenticity of the composition of these holistic scores.

According to what Kumar and Boulanger (2020) report on the performance of several state-of-the-art AES systems trained on ASAP’s seventh essay dataset, the AES system they developed and which will be reused in this paper proved competitive while being fully and deeply interpretable, which no other AES system does. They also supply further information about the study setting, essay datasets, rubrics, features, natural language processing (NLP) tools, model training, and evaluation against human performance. Again, this paper showcases the application of explainable artificial intelligence in automated essay scoring by focusing on the decision process of the Rubric #3 (Style) scoring model. Remember that the same methodology is applicable to each rubric.

Explanation Model: SHAP

SH apley A dditive ex P lanations (SHAP) is a theoretically justified XAI framework that can provide simultaneously both local and global explanations ( Molnar, 2020 ); that is, SHAP is able to explain individual predictions taking into account the uniqueness of each prediction, while highlighting the global factors influencing the overall performance of a predictive model. SHAP is of keen interest because it unifies all algorithms of the class of additive feature attribution methods, adhering to a set of three properties that are desirable in interpretable machine learning: local accuracy, missingness, and consistency ( Lundberg and Lee, 2017 ). A key advantage of SHAP is that feature contributions are all expressed in terms of the outcome variable (e.g., rubric scores), providing a same scale to compare the importance of each feature against each other. Local accuracy refers to the fact that no matter the explanation model, the sum of all feature contributions is always equal to the prediction explained by these features. The missingness property implies that the prediction is never explained by unmeasured factors, which are always assigned a contribution of zero. However, the converse is not true; a contribution of zero does not imply an unobserved factor, it can also denote a feature irrelevant to explain the prediction. The consistency property guarantees that a more important feature will always have a greater magnitude than a less important one, no matter how many other features are included in the explanation model. SHAP proves superior to other additive attribution methods such as LIME (Local Interpretable Model-Agnostic Explanations), Shapley values, and DeepLIFT in that they never comply with all three properties, while SHAP does ( Lundberg and Lee, 2017 ). Moreover, the way SHAP assesses the importance of a feature differs from permutation importance methods (e.g., ELI5), measured as the decrease in model performance (accuracy) as a feature is perturbated, in that it is based on how much a feature contributes to every prediction.

Essentially, a SHAP explanation model (linear regression) is trained on top of a predictive model, which in this case is a complex ensemble deep learning model. Table 3 demonstrates a scale explanation model showing how SHAP values (feature contributions) work. In this example, there are five instances and five features describing each instance (in the context of this paper, an instance is an essay). Predictions are listed in the second to last column, and the base value is the mean of all predictions. The base value constitutes the reference point according to which predictions are explained; in other words, reasons are given to justify the discrepancy between the individual prediction and the mean prediction (the base value). Notice that the table does not contain the actual feature values; these are SHAP values that quantify the contribution of each feature to the predicted score. For example, the prediction of Instance 1 is 2.46, while the base value is 3.76. Adding up the feature contributions of Instance 1 to the base value produces the predicted score:

www.frontiersin.org

Table 3. Array of SHAP values: local and global importance of features and feature coverage per instance.

Hence, the generic equation of the explanation model ( Lundberg and Lee, 2017 ) is:

where g(x) is the prediction of an individual instance x, σ 0 is the base value, σ i is the feature contribution of feature x i , x i ∈ {0,1} denotes whether feature x i is part of the individual explanation, and j is the total number of features. Furthermore, the global importance of a feature is calculated by adding up the absolute values of its corresponding SHAP values over all instances, where n is the total number of instances and σ i ( j ) is the feature contribution for instance i ( Lundberg et al., 2018 ):

Therefore, it can be seen that Feature 3 is the most globally important feature, while Feature 2 is the least important one. Similarly, Feature 5 is Instance 3’s most important feature at the local level, while Feature 2 is the least locally important. The reader should also note that a feature shall not necessarily be assigned any contribution; some of them are just not part of the explanation such as Feature 2 and Feature 3 in Instance 2. These concepts lay the foundation for the explainable AES system presented in this paper. Just imagine that each instance (essay) will be rather summarized by 282 features and that the explanations of all the testing set’s 314 essays will be provided.

Several implementations of SHAP exist: KernelSHAP, DeepSHAP, GradientSHAP, and TreeSHAP, among others. KernelSHAP is model-agnostic and works for any type of predictive models; however, KernelSHAP is very computing-intensive which makes it undesirable for practical purposes. DeepSHAP and GradientSHAP are two implementations intended for deep learning which takes advantage of the known properties of neural networks (i.e., MLP-NN, CNN, or RNN) to accelerate up to three orders of magnitude the processing time to explain predictions ( Chen et al., 2019 ). Finally, TreeSHAP is the most powerful implementation intended for tree-based models. TreeSHAP is not only fast; it is also accurate. While the three former implementations estimate SHAP values, TreeSHAP computes them exactly. Moreover, TreeSHAP not only measures the contribution of individual features, but it also considers interactions between pairs of features and assigns them SHAP values. Since one of the goals of this paper is to assess the potential of deep learning on the performance of both predictive and explanation models, this research tested the former three implementations. TreeSHAP is recommended for future work since the interaction among features is critical information to consider. Moreover, KernelSHAP, DeepSHAP, and GradientSHAP all require access to the whole original dataset to derive the explanation of a new instance, another constraint TreeSHAP is not subject to.

Descriptive Accuracy: Trustworthiness of Explanation Models

This paper reuses and adapts the methodology introduced by Ribeiro et al. (2016) . Several explanation models will be trained, using different SHAP implementations and configurations, per deep learning predictive model (for each number of hidden layers). The rationale consists in randomly selecting and ignoring 25% of the 282 features feeding the predictive model (e.g., turning them to zero). If it causes the prediction to change beyond a specific threshold (in this study 0.10 and 0.25 were tested), then the explanation model should also reflect the magnitude of this change while ignoring the contributions of these same features. For example, the original predicted rubric score of an essay might be 5; however, when ignoring the information brought in by a subset of 70 randomly selected features (25% of 282), the prediction may turn to 4. On the other side, if the explanation model also predicts a 4 while ignoring the contributions of the same subset of features, then the explanation is considered as trustworthy. This allows to compute the precision, recall, and F1-score of each explanation model (number of true and false positives and true and false negatives). The process is repeated 500 times for every essay to determine the average precision and recall of every explanation model.

Judging Relevancy

So far, the consistency of explanations with predictions has been considered. However, consistent explanations do not imply relevant or meaningful explanations. Put another way, explanations only reflect what predictive models have learned during training. How can the black box of these explanations be opened? Looking directly at the numerical SHAP values of each explanation might seem a daunting task, but there exist tools, mainly visualizations (decision plot, summary plot, and dependence plot), that allow to make sense out of these explanations. However, before visualizing these explanations, another question needs to be addressed: which explanations or essays should be picked for further scrutiny of the AES system? Given the huge number of essays to examine and the tedious task to understand the underpinnings of a single explanation, a small subset of essays should be carefully picked that should represent concisely the state of correctness of the underlying predictive model. Again, this study applies and adapts the methodology in Ribeiro et al. (2016) . A greedy algorithm selects essays whose predictions are explained by as many features of global importance as possible to optimize feature coverage. Ribeiro et al. demonstrated in unrelated studies (i.e., sentiment analysis) that the correctness of a predictive model can be assessed with as few as four or five well-picked explanations.

For example, Table 3 reveals the global importance of five features. The square root of each feature’s global importance is also computed and considered instead to limit the influence of a small group of very influential features. The feature coverage of Instance 1 is 100% because all features are engaged in the explanation of the prediction. On the other hand, Instance 2 has a feature coverage of 61.5% because only Features 1, 4, and 5 are part of the prediction’s explanation. The feature coverage is calculated by summing the square root of each explanation’s feature’s global importance together and dividing by the sum of the square roots of all features’ global importance:

Additionally, it can be seen that Instance 4 does not have any zero-feature value although its feature coverage is only 84.6%. The algorithm was constrained to discard from the explanation any feature whose contribution (local importance) was too close to zero. In the case of Table 3 ’s example, any feature whose absolute SHAP value is less than 0.10 is ignored, hence leading to a feature coverage of:

In this paper’s study, the real threshold was 0.01. This constraint was actually a requirement for the DeepSHAP and GradientSHAP implementations because they only output non-zero SHAP values contrary to KernelSHAP which generates explanations with a fixed number of features: a non-zero SHAP value indicates that the feature is part of the explanation, while a zero value excludes the feature from the explanation. Without this parameter, all 282 features would be part of the explanation although a huge number only has a trivial (very close to zero) SHAP value. Now, a much smaller but variable subset of features makes up each explanation. This is one way in which Ribeiro et al.’s SP-LIME algorithm (SP stands for Submodular Pick) has been adapted to this study’s needs. In conclusion, notice how Instance 4 would be selected in preference to Instance 5 to explain Table 3 ’s underlying predictive model. Even though both instances have four features explaining their prediction, Instance 4’s features are more globally important than Instance 5’s features, and therefore Instance 4 has greater feature coverage than Instance 5.

Whereas Table 3 ’s example exhibits the feature coverage of one instance at a time, this study computes it for a subset of instances, where the absolute SHAP values are aggregated (summed) per candidate subset. When the sum of absolute SHAP values per feature exceeds the set threshold, the feature is then considered as covered by the selected set of instances. The objective in this study was to optimize the feature coverage while minimizing the number of essays to validate the AES model.

Research Questions

One of this article’s objectives is to assess the potential of deep learning in automated essay scoring. The literature has often claimed ( Hussein et al., 2019 ) that there are two approaches to AES, feature-based and deep learning, as though these two approaches were mutually exclusive. Yet, the literature also puts forward that feature-based AES models may be more interpretable than deep learning ones ( Amorim et al., 2018 ). This paper embraces the viewpoint that these two approaches can also be complementary by leveraging the state-of-the-art in NLP and automatic linguistic analysis and harnessing one of the richest pools of linguistic indices put forward in the research community ( Crossley et al., 2016 , 2017 , 2019 ; Kyle, 2016 ; Kyle et al., 2018 ) and applying a thorough feature selection process powered by deep learning. Moreover, the ability of deep learning of modeling complex non-linear relationships makes it particularly well-suited for AES given that the importance of a writing feature is highly dependent on its context, that is, its interactions with other writing features. Besides, this study leverages the SHAP interpretation method that is well-suited to interpret very complex models. Hence, this study elected to work with deep learning models and ensembles to test SHAP’s ability to explain these complex models. Previously, the literature has revealed the difficulty to have at the same time both accurate and interpretable models ( Ribeiro et al., 2016 ; Murdoch et al., 2019 ), where favoring one comes at the expense of the other. However, this research shows how XAI makes it now possible to produce both accurate and interpretable models in the area of AES. Since ensembles have been repeatedly shown to boost the accuracy of predictive models, they were included as part of the tested deep learning architectures to maximize generalizability and accuracy, while making these predictive models interpretable and exploring whether deep learning can even enhance their descriptive accuracy further.

This study investigates the trustworthiness of explanation models, and more specifically, those explaining deep learning predictive models. For instance, does the depth, defined as the number of hidden layers, of an MLP neural network increases the trustworthiness of its SHAP explanation model? The answer to this question will help determine whether it is possible to have very accurate AES models while having competitively interpretable/explainable models, the corner stone for the generation of formative feedback. Remember that formative feedback is defined as “any kind of information provided to students about their actual state of learning or performance in order to modify the learner’s thinking or behavior in the direction of the learning standards” and that formative feedback “conveys where the student is, what are the goals to reach, and how to reach the goals” ( Goldin et al., 2017 ). This notion contrasts with summative feedback which basically is “a justification of the assessment results” ( Hao and Tsikerdekis, 2019 ).

As pointed out in the previous section, multiple SHAP implementations are evaluated in this study. Hence, this paper showcases whether the faster DeepSHAP and GradientSHAP implementations are as reliable as the slower KernelSHAP implementation . The answer to this research question will shed light on the feasibility of providing immediate formative feedback and this multiple times throughout students’ writing processes.

This study also looks at whether a summary of the data produces as trustworthy explanations as those from the original data . This question will be of interest to AES researchers and practitioners because it could allow to significantly decrease the processing time of the computing-intensive and model-agnostic KernelSHAP implementation and test further the potential of customizable explanations.

KernelSHAP allows to specify the total number of features that will shape the explanation of a prediction; for instance, this study experiments with explanations of 16 and 32 features and observes whether there exists a statistically significant difference in the reliability of these explanation models . Knowing this will hint at whether simpler or more complex explanations are more desirable when it comes to optimize their trustworthiness. If there is no statistically significant difference, then AES practitioners are given further flexibility in the selection of SHAP implementations to find the sweet spot between complexity of explanations and speed of processing. For instance, the KernelSHAP implementation allows to customize the number of factors making up an explanation, while the faster DeepSHAP and GradientSHAP do not.

Finally, this paper highlights the means to debug and compare the performance of predictive models through their explanations. Once a model is debugged, the process can be reused to fine-tune feature selection and/or feature engineering to improve predictive models and for the generation of formative feedback to both students and teachers.

The training, validation, and testing sets consist of 1567 essays, each of which has been scored by two human raters, who assigned a score between 0 and 3 per rubric (ideas, organization, style, and conventions). In particular, this article looks at predictive and descriptive accuracy of AES models on the third rubric, style. Note that although each essay has been scored by two human raters, the literature ( Shermis, 2014 ) is not explicit about whether only two or more human raters participated in the scoring of all 1567 essays; given the huge number of essays, it is likely that more than two human raters were involved in the scoring of these essays so that the amount of noise introduced by the various raters’ biases is unknown while probably being at some degree balanced among the two groups of raters. Figure 2 shows the confusion matrices of human raters on Style Rubric. The diagonal elements (dark gray) correspond to exact matches, whereas the light gray squares indicate adjacent matches. Figure 2A delineates the number of essays per pair of ratings, and Figure 2B shows the percentages per pair of ratings. The agreement level between each pair of human raters, measured by the quadratic weighted kappa, is 0.54; the percentage of exact matches is 65.3%; the percentage of adjacent matches is 34.4%; and 0.3% of essays are neither exact nor adjacent matches. Figures 2A,B specify the distributions of 0−3 ratings per group of human raters. Figure 2C exhibits the distribution of resolved scores (a resolved score is the sum of the two human ratings). The mean is 3.99 (with a standard deviation of 1.10), and the median and mode are 4. It is important to note that the levels of predictive accuracy reported in this article are measured on the scale of resolved scores (0−6) and that larger scales tend to slightly inflate quadratic weighted kappa values, which must be taken into account when comparing against the level of agreement between human raters. Comparison of percentages of exact and adjacent matches must also be made with this scoring scale discrepancy in mind.

www.frontiersin.org

Figure 2. Summary of the essay dataset (1567 Grade-7 narrative essays) investigated in this study. (A) Number of essays per pair of human ratings; the diagonal (dark gray squares) lists the numbers of exact matches while the light-gray squares list the numbers of adjacent matches; and the bottom row and the rightmost column highlight the distributions of ratings for both groups of human raters. (B) Percentages of essays per pair of human ratings; the diagonal (dark gray squares) lists the percentages of exact matches while the light-gray squares list the percentages of adjacent matches; and the bottom row and the rightmost column highlight the distributions (frequencies) of ratings for both groups of human raters. (C) The distribution of resolved rubric scores; a resolved score is the addition of its two constituent human ratings.

Predictive Accuracy and Descriptive Accuracy

Table 4 compiles the performance outcomes of the 10 predictive models evaluated in this study. The reader should remember that the performance of each model was averaged over five iterations and that two models were trained per number of hidden layers, one non-ensemble and one ensemble. Except for the 6-layer models, there is no clear winner among other models. Even for the 6-layer models, they are superior in terms of exact matches, the primary goal for a reliable AES system, but not according to adjacent matches. Nevertheless, on average ensemble models slightly outperform non-ensemble models. Hence, these ensemble models will be retained for the next analysis step. Moreover, given that five ensemble models were trained per neural network depth, the most accurate model among the five is selected and displayed in Table 4 .

www.frontiersin.org

Table 4. Performance of majority classifier and average/maximal performance of trained predictive models.

Next, for each selected ensemble predictive model, several explanation models are trained per predictive model. Every predictive model is explained by the “Deep,” “Grad,” and “Random” explainers, except for the 6-layer model where it was not possible to train a “Deep” explainer apparently due to a bug in the original SHAP code caused by either a unique condition in this study’s data or neural network architecture. However, this was beyond the scope of this study to fix and investigate this issue. As it will be demonstrated, no statistically significant difference exists between the accuracy of these explainers.

The “Random” explainer serves as a baseline model for comparison purpose. Remember that to evaluate the reliability of explanation models, the concurrent impact of randomly selecting and ignoring a subset of features on the prediction and explanation of rubric scores is analyzed. If the prediction changes significantly and its corresponding explanation changes (beyond a set threshold) accordingly (a true positive) or if the prediction remains within the threshold as does the explanation (a true negative), then the explanation is deemed as trustworthy. Hence, in the case of the Random explainer, it simulates random explanations by randomly selecting 32 non-zero features from the original set of 282 features. These random explanations consist only of non-zero features because, according to SHAP’s missingness property, a feature with a zero or a missing value never gets assigned any contribution to the prediction. If at least one of these 32 features is also an element of the subset of the ignored features, then the explanation is considered as untrustworthy, no matter the size of a feature’s contribution.

As for the layer-2 model, six different explanation models are evaluated. Recall that layer-2 models generated the least mean squared error (MSE) during hyperparameter optimization (see Table 1 ). Hence, this specific type of architecture was selected to test the reliability of these various explainers. The “Kernel” explainer is the most computing-intensive and took approximately 8 h of processing. It was trained using the full distributions of feature values in the training set and shaped explanations in terms of 32 features; the “Kernel-16” and “Kernel-32” models were trained on a summary (50 k -means centroids) of the training set to accelerate the processing by about one order of magnitude (less than 1 h). Besides, the “Kernel-16” explainer derived explanations in terms of 16 features, while the “Kernel-32” explainer explained predictions through 32 features. Table 5 exhibits the descriptive accuracy of these various explanation models according to a 0.10 and 0.25 threshold; in other words, by ignoring a subset of randomly picked features, it assesses whether or not the prediction and explanation change simultaneously. Note also how each explanation model, no matter the underlying predictive model, outperforms the “Random” model.

www.frontiersin.org

Table 5. Precision, recall, and F1 scores of the various explainers tested per type of predictive model.

The first research question addressed in this subsection asks whether there exists a statistically significant difference between the “Kernel” explainer, which generates 32-feature explanations and is trained on the whole training set, and the “Kernel-32” explainer which also generates 32-feature explanations and is trained on a summary of the training set. To determine this, an independent t-test was conducted using the precision, recall, and F1-score distributions (500 iterations) of both explainers. Table 6 reports the p -values of all the tests and for the 0.10 and 0.25 thresholds. It reveals that there is no statistically significant difference between the two explainers.

www.frontiersin.org

Table 6. p -values of independent t -tests comparing whether there exist statistically significant differences between the mean precisions, recalls, and F1-scores of 2-layer explainers and between those of the 2-layer’s, 4-layer’s, and 6-layer’s Gradient explainers.

The next research question tests whether there exists a difference in the trustworthiness of explainers shaping 16 or 32-feature explanations. Again t-tests were conducted to verify this. Table 6 lists the resulting p -values. Again, there is no statistically significant difference in the average precisions, recalls, and F1-scores of both explainers.

This leads to investigating whether the “Kernel,” “Deep,” and “Grad” explainers are equivalent. Table 6 exhibits the results of the t-tests conducted to verify this and reveals that none of the explainers produce a statistically significantly better performance than the other.

Armed with this evidence, it is now possible to verify whether deeper MLP neural networks produce more trustworthy explanation models. For this purpose, the performance of the “Grad” explainer for each type of predictive model will be compared against each other. The same methodology as previously applied is employed here. Table 6 , again, confirms that the explanation model of the 2-layer predictive model is statistically significantly less trustworthy than the 4-layer’s explanation model; the same can be said of the 4-layer and 6-layer models. The only exception is the difference in average precision between 2-layer and 4-layer models and between 4-layer and 6-layer models; however, there clearly exists a statistically significant difference in terms of precision (and also recall and F1-score) between 2-layer and 6-layer models.

The Best Subset of Essays to Judge AES Relevancy

Table 7 lists the four best essays optimizing feature coverage (93.9%) along with their resolved and predicted scores. Notice how two of the four essays were picked by the adapted SP-LIME algorithm with some strong disagreement between the human and the machine graders, two were picked with short and trivial text, and two were picked exhibiting perfect agreement between the human and machine graders. Interestingly, each pair of longer and shorter essays exposes both strong agreement and strong disagreement between the human and AI agents, offering an opportunity to debug the model and evaluate its ability to detect the presence or absence of more basic (e.g., very small number of words, occurrences of sentence fragments) and more advanced aspects (e.g., cohesion between adjacent sentences, variety of sentence structures) of narrative essay writing and to appropriately reward or penalize them.

www.frontiersin.org

Table 7. Set of best essays to evaluate the correctness of the 6-layer ensemble AES model.

Local Explanation: The Decision Plot

The decision plot lists writing features by order of importance from top to bottom. The line segments display the contribution (SHAP value) of each feature to the predicted rubric score. Note that an actual decision plot consists of all 282 features and that only the top portion of it (20 most important features) can be displayed (see Figure 3 ). A decision plot is read from bottom to top. The line starts at the base value and ends at the predicted rubric score. Given that the “Grad” explainer is the only explainer common to all predictive models, it has been selected to derive all explanations. The decision plots in Figure 3 show the explanations of the four essays in Table 7 ; the dashed line in these plots represents the explanation of the most accurate predictive model, that is the ensemble model with 6 hidden layers which also produced the most trustworthy explanation model. The predicted rubric score of each explanation model is listed in the bottom-right legend. Explanation of the writing features follow in a next subsection.

www.frontiersin.org

Figure 3. Comparisons of all models’ explanations of the most representative set of four essays: (A) Essay 228, (B) Essay 68, (C) Essay 219, and (D) Essay 124.

Global Explanation: The Summary Plot

It is advantageous to use SHAP to build explanation models because it provides a single framework to discover the writing features that are important to an individual essay (local) or a set of essays (global). While the decision plots list features of local importance, Figure 4 ’s summary plot ranks writing features by order of global importance (from top to bottom). All testing set’s 314 essays are represented as dots in the scatterplot of each writing feature. The position of a dot on the horizontal axis corresponds to the importance (SHAP value) of the writing feature for a specific essay and its color indicates the magnitude of the feature value in relation to the range of all 314 feature values. For example, large or small numbers of words within an essay generally contribute to increase or decrease rubric scores by up to 1.5 and 1.0, respectively. Decision plots can also be used to find the most important features for a small subset of essays; Figure 5 demonstrates the new ordering of writing indices when aggregating the feature contributions (summing the absolute values of SHAP values) of the four essays in Table 7 . Moreover, Figure 5 allows to compare the contributions of a feature to various essays. Note how the orderings in Figures 3 −5 can differ from each other, sharing many features of global importance as well as having their own unique features of local importance.

www.frontiersin.org

Figure 4. Summary plot listing the 32 most important features globally.

www.frontiersin.org

Figure 5. Decision plot delineating the best model’s explanations of Essays 228, 68, 219, and 124 (6-layer ensemble).

Definition of Important Writing Indices

The reader shall understand that it is beyond the scope of this paper to make a thorough description of all writing features. Nevertheless, the summary and decision plots in Figures 4 , 5 allow to identify a subset of features that should be examined in order to validate this study’s predictive model. Supplementary Table 1 combines and describes the 38 features in Figures 4 , 5 .

Dependence Plots

Although the summary plot in Figure 4 is insightful to determine whether small or large feature values are desirable, the dependence plots in Figure 6 prove essential to recommend whether a student should aim at increasing or decreasing the value of a specific writing feature. The dependence plots also reveal whether the student should directly act upon the targeted writing feature or indirectly on other features. The horizontal axis in each of the dependence plots in Figure 6 is the scale of the writing feature and the vertical axis is the scale of the writing feature’s contributions to the predicted rubric scores. Each dot in a dependence plot represents one of the testing set’s 314 essays, that is, the feature value and SHAP value belonging to the essay. The vertical dispersion of the dots on small intervals of the horizontal axis is indicative of interaction with other features ( Molnar, 2020 ). If the vertical dispersion is widespread (e.g., the [50, 100] horizontal-axis interval in the “word_count” dependence plot), then the contribution of the writing feature is most likely at some degree dependent on other writing feature(s).

www.frontiersin.org

Figure 6. Dependence plots: the horizontal axes represent feature values while vertical axes represent feature contributions (SHAP values). Each dot represents one of the 314 essays of the testing set and is colored according to the value of the feature with which it interacts most strongly. (A) word_count. (B) hdd42_aw. (C) ncomp_stdev. (D) dobj_per_cl. (E) grammar. (F) SENTENCE_FRAGMENT. (G) Sv_GI. (H) adjacent_overlap_verb_sent.

The contributions of this paper can be summarized as follows: (1) it proposes a means (SHAP) to explain individual predictions of AES systems and provides flexible guidelines to build powerful predictive models using more complex algorithms such as ensembles and deep learning neural networks; (2) it applies a methodology to quantitatively assess the trustworthiness of explanation models; (3) it tests whether faster SHAP implementations impact the descriptive accuracy of explanation models, giving insight on the applicability of SHAP in real pedagogical contexts such as AES; (4) it offers a toolkit to debug AES models, highlights linguistic intricacies, and underscores the means to offer formative feedback to novice writers; and more importantly, (5) it empowers learning analytics practitioners to make AI pedagogical agents accountable to the human educator, the ultimate problem holder responsible for the decisions and actions of AI ( Abbass, 2019 ). Basically, learning analytics (which encompasses tools such as AES) is characterized as an ethics-bound, semi-autonomous, and trust-enabled human-AI fusion that recurrently measures and proactively advances knowledge boundaries in human learning.

To exemplify this, imagine an AES system that supports instructors in the detection of plagiarism, gaming behaviors, and the marking of writing activities. As previously mentioned, essays are marked according to a grid of scoring rubrics: ideas, organization, style, and conventions. While an abundance of data (e.g., the 1592 writing metrics) can be collected by the AES tool, these data might still be insufficient to automate the scoring process of certain rubrics (e.g., ideas). Nevertheless, some scoring subtasks such as assessing a student’s vocabulary, sentence fluency, and conventions might still be assigned to AI since the data types available through existing automatic linguistic analysis tools prove sufficient to reliably alleviate the human marker’s workload. Interestingly, learning analytics is key for the accountability of AI agents to the human problem holder. As the volume of writing data (through a large student population, high-frequency capture of learning episodes, and variety of big learning data) accumulate in the system, new AI agents (predictive models) may apply for the job of “automarker.” These AI agents can be quite transparent through XAI ( Arrieta et al., 2020 ) explanation models, and a human instructor may assess the suitability of an agent for the job and hire the candidate agent that comes closest to human performance. Explanations derived from these models could serve as formative feedback to the students.

The AI marker can be assigned to assess the writing activities that are similar to those previously scored by the human marker(s) from whom it learns. Dissimilar and unseen essays can be automatically assigned to the human marker for reliable scoring, and the AI agent can learn from this manual scoring. To ensure accountability, students should be allowed to appeal the AI agent’s marking to the human marker. In addition, the human marker should be empowered to monitor and validate the scoring of select writing rubrics scored by the AI marker. If the human marker does not agree with the machine scores, the writing assignments may be flagged as incorrectly scored and re-assigned to a human marker. These flagged assignments may serve to update predictive models. Moreover, among the essays that are assigned to the machine marker, a small subset can be simultaneously assigned to the human marker for continuous quality control; that is, to continue comparing whether the agreement level between human and machine markers remains within an acceptable threshold. The human marker should be at any time able to “fire” an AI marker or “hire” an AI marker from a pool of potential machine markers.

This notion of a human-AI fusion has been observed in previous AES systems where the human marker’s workload has been found to be significantly alleviated, passing from scoring several hundreds of essays to just a few dozen ( Dronen et al., 2015 ; Hellman et al., 2019 ). As the AES technology matures and as the learning analytics tools continue to penetrate the education market, this alliance of semi-autonomous human and AI agents will lead to better evidence-based/informed pedagogy ( Nelson and Campbell, 2017 ). Such a human-AI alliance can also be guided to autonomously self-regulate its own hypothesis-authoring and data-acquisition processes for purposes of measuring and advancing knowledge boundaries in human learning.

Real-Time Formative Pedagogical Feedback

This paper provides the evidence that deep learning and SHAP can be used not only to score essays automatically but also to offer explanations in real-time. More specifically, the processing time to derive the 314 explanations of the testing set’s essays has been benchmarked for several types of explainers. It was found that the faster DeepSHAP and GradientSHAP implementations, which took only a few seconds of processing, did not produce less accurate explanations than the much slower KernelSHAP. KernelSHAP took approximately 8 h of processing to derive the explanation model of a 2-layer MLP neural network predictive model and 16 h for the 6-layer predictive model.

This finding also holds for various configurations of KernelSHAP, where the number of features (16 vs. 32) shaping the explanation (where all other features are assigned zero contributions) did not produce a statistically significant difference in the reliability of the explanation models. On average, the models had a precision between 63.9 and 64.1% and a recall between 41.0 and 42.9%. This means that after perturbation of the predictive and explanation models, on average 64% of the predictions the explanation model identified as changing were accurate. On the other side, only about 42% of all predictions that changed were detected by the various 2-layer explainers. An explanation was considered as untrustworthy if the sum of its feature contributions, when added to the average prediction (base value), was not within 0.1 from the perturbated prediction. Similarly, the average precision and recall of 2-layer explainers for the 0.25-threshold were about 69% and 62%, respectively.

Impact of Deep Learning on Descriptive Accuracy of Explanations

By analyzing the performance of the various predictive models in Table 4 , no clear conclusion can be reached as to which model should be deemed as the most desirable. Despite the fact that the 6-layer models slightly outperform the other models in terms of accuracy (percentage of exact matches between the resolved [human] and predicted [machine] scores), they are not the best when it comes to the percentages of adjacent (within 1 and 2) matches. Nevertheless, if the selection of the “best” model is based on the quadratic weighted kappas, the decision remains a nebulous one to make. Moreover, ensuring that machine learning actually learned something meaningful remains paramount, especially in contexts where the performance of a majority classifier is close to the human and machine performance. For example, a majority classifier model would get 46.3% of predictions accurate ( Table 4 ), while trained predictive models at best produce accurate predictions between 51.9 and 55.1%.

Since the interpretability of a machine learning model should be prioritized over accuracy ( Ribeiro et al., 2016 ; Murdoch et al., 2019 ) for questions of transparency and trust, this paper investigated whether the impact of the depth of a MLP neural network might be more visible when assessing its interpretability, that is, the trustworthiness of its corresponding SHAP explanation model. The data in Tables 1 , 5 , 6 effectively support the hypothesis that as the depth of the neural network increases, the precision and recall of the corresponding explanation model improve. Besides, this observation is particularly interesting because the 4-layer (Grad) explainer, which has hardly more parameters than the 2-layer model, is also more accurate than the 2-layer model, suggesting that the 6-layer explainer is most likely superior to other explainers not only because of its greater number of parameters, but also because of its number of hidden layers. By increasing the number of hidden layers, it can be seen that the precision and recall of an explanation model can pass on average from approximately 64 to 73% and from 42 to 52%, respectively, for the 0.10-threshold; and for the 0.25-threshold, from 69 to 79% and from 62 to 75%, respectively.

These results imply that the descriptive accuracy of an explanation model is an evidence of effective machine learning, which may exceed the level of agreement between the human and machine graders. Moreover, given that the superiority of a trained predictive model over a majority classifier is not always obvious, the consistency of its associated explanation model demonstrates this better. Note that theoretically the SHAP explanation model of the majority classifier should assign a zero contribution to each writing feature since the average prediction of such a model is actually the most frequent rubric score given by the human raters; hence, the base value is the explanation.

An interesting fact emerges from Figure 3 , that is, all explainers (2-layer to 6-layer) are more or less similar. It appears that they do not contradict each other. More specifically, they all agree on the direction of the contributions of the most important features. In other words, they unanimously determine that a feature should increase or decrease the predicted score. However, they differ from each other on the magnitude of the feature contributions.

To conclude, this study highlights the need to train predictive models that consider the descriptive accuracy of explanations. The idea is that explanation models consider predictions to derive explanations; explanations should be considered when training predictive models. This would not only help train interpretable models the very first time but also potentially break the status quo that may exist among similar explainers to possibly produce more powerful models. In addition, this research calls for a mechanism (e.g., causal diagrams) to allow teachers to guide the training process of predictive models. Put another way, as LA practitioners debug predictive models, their insights should be encoded in a language that will be understood by the machine and that will guide the training process to avoid learning the same errors and to accelerate the training time.

Accountable AES

Now that the superiority of the 6-layer predictive and explanation models has been demonstrated, some aspects of the relevancy of explanations should be examined more deeply, knowing that having an explanation model consistent with its underlying predictive model does not guarantee relevant explanations. Table 7 discloses the set of four essays that optimize the coverage of most globally important features to evaluate the correctness of the best AES model. It is quite intriguing to note that two of the four essays are among the 16 essays that have a major disagreement (off by 2) between the resolved and predicted rubric scores (1 vs. 3 and 4 vs. 2). The AES tool clearly overrated Essay 228, while it underrated Essay 219. Naturally, these two essays offer an opportunity to understand what is wrong with the model and ultimately debug the model to improve its accuracy and interpretability.

In particular, Essay 228 raises suspicion on the positive contributions of features such as “Ortho_N,” “lemma_mattr,” “all_logical,” “det_pobj_deps_struct,” and “dobj_per_cl.” Moreover, notice how the remaining 262 less important features (not visible in the decision plot in Figure 5 ) have already inflated the rubric score beyond the base value, more than any other essay. Given the very short length and very low quality of the essay, whose meaning is seriously undermined by spelling and grammatical errors, it is of utmost importance to verify how some of these features are computed. For example, is the average number of orthographic neighbors (Ortho_N) per token computed for unmeaningful tokens such as “R” and “whe”? Similarly, are these tokens considered as types in the type-token ratio over lemmas (lemma_mattr)? Given the absence of a meaningful grammatical structure conveying a complete idea through well-articulated words, it becomes obvious that the quality of NLP (natural language processing) parsing may become a source of (measurement) bias impacting both the way some writing features are computed and the predicted rubric score. To remedy this, two solutions are proposed: (1) enhancing the dataset with the part-of-speech sequence or the structure of dependency relationships along with associated confidence levels, or (2) augmenting the essay dataset with essays enclosing various types of non-sensical content to improve the learning of these feature contributions.

Note that all four essays have a text length smaller than the average: 171 words. Notice also how the “hdd42_aw” and “hdd42_fw” play a significant role to decrease the predicted score of Essays 228 and 68. The reader should note that these metrics require a minimum of 42 tokens in order to compute a non-zero D index, a measure of lexical diversity as explained in Supplementary Table 1 . Figure 6B also shows how zero “hdd42_aw” values are heavily penalized. This is extra evidence that supports the strong role that the number of words plays in determining these rubric scores, especially for very short essays where it is one of the few observations that can be reliably recorded.

Two other issues with the best trained AES model were identified. First, in the eyes of the model, the lowest the average number of direct objects per clause (dobj_per_cl), as seen in Figure 6D , the best it is. This appears to contradict one of the requirements of the “Style” rubric, which looks for a variety of sentence structures. Remember that direct objects imply the presence of transitive verbs (action verbs) and that the balanced usage of linking verbs and action verbs as well as of transitive and intransitive verbs is key to meet the requirement of variety of sentence structures. Moreover, note that the writing feature is about counting the number of direct objects per clause, not by sentence. Only one direct object is therefore possible per clause. On the other side, a sentence may contain several clauses, which determines if the sentence is a simple, compound, or a complex sentence. This also means that a sentence may have multiple direct objects and that a high ratio of direct objects per clause is indicative of sentence complexity. Too much complexity is also undesirable. Hence, it is fair to conclude that the higher range of feature values has reasonable feature contributions (SHAP values), while the lower range does not capture well the requirements of the rubric. The dependence plot should rather display a positive peak somewhere in the middle. Notice how the poor quality of Essay 228’s single sentence prevented the proper detection of the single direct object, “broke my finger,” and the so-called absence of direct objects was one of the reasons to wrongfully improve the predicted rubric score.

The model’s second issue discussed here is the presence of sentence fragments, a type of grammatical errors. Essentially, a sentence fragment is a clause that misses one of three critical components: a subject, a verb, or a complete idea. Figure 6E shows the contribution model of grammatical errors, all types combined, while Figure 6F shows specifically the contribution model of sentence fragments. It is interesting to see how SHAP further penalizes larger numbers of grammatical errors and that it takes into account the length of the essay (red dots represent essays with larger numbers of words; blue dots represent essays with smaller numbers of words). For example, except for essays with no identified grammatical errors, longer essays are less penalized than shorter ones. This is particularly obvious when there are 2−4 grammatical errors. The model increases the predicted rubric score only when there is no grammatical error. Moreover, the model tolerates longer essays with only one grammatical error, which sounds quite reasonable. On the other side, the model finds desirable high numbers of sentence fragments, a non-trivial type of grammatical errors. Even worse, the model decreases the rubric score of essays having no sentence fragment. Although grammatical issues are beyond the scope of the “Style” rubric, the model has probably included these features because of their impact on the quality of assessment of vocabulary usage and sentence fluency. The reader should observe how the very poor quality of an essay can even prevent the detection of such fundamental grammatical errors such as in the case of Essay 228, where the AES tool did not find any grammatical error or sentence fragment. Therefore, there should be a way for AES systems to detect a minimum level of text quality before attempting to score an essay. Note that the objective of this section was not to undertake thorough debugging of the model, but rather to underscore the effectiveness of SHAP in doing so.

Formative Feedback

Once an AES model is considered reasonably valid, SHAP can be a suitable formalism to empower the machine to provide formative feedback. For instance, the explanation of Essay 124, which has been assigned a rubric score of 3 by both human and machine markers, indicates that the top two factors contributing to decreasing the predicted rubric score are: (1) the essay length being smaller than average, and (2) the average number of verb lemma types occurring at least once in the next sentence (adjacent_overlap_verb_sent). Figures 6A,H give the overall picture in which the realism of the contributions of these two features can be analyzed. More specifically, Essay 124 is one of very few essays ( Figure 6H ) that makes redundant usage of the same verbs across adjacent sentences. Moreover, the essay displays poor sentence fluency where everything is only expressed in two sentences. To understand more accurately the impact of “adjacent_overlap_verb_sent” on the prediction, a few spelling errors have been corrected and the text has been divided in four sentences instead of two. Revision 1 in Table 8 exhibits the corrections made to the original essay. The decision plot’s dashed line in Figure 3D represents the original explanation of Essay 124, while Figure 7A demonstrates the new explanation of the revised essay. It can be seen that the “adjacent_overlap_verb_sent” feature is still the second most important feature in the new explanation of Essay 124, with a feature value of 0.429, still considered as very poor according to the dependence plot in Figure 6H .

www.frontiersin.org

Table 8. Revisions of Essay 124: improvement of sentence splitting, correction of some spelling errors, and elimination of redundant usage of same verbs (bold for emphasis in Essay 124’s original version; corrections in bold for Revisions 1 and 2).

www.frontiersin.org

Figure 7. Explanations of the various versions of Essay 124 and evaluation of feature effect for a range of feature values. (A) Explanation of Essay 124’s first revision. (B) Forecasting the effect of changing the ‘adjacent_overlap_verb_sent’ feature on the rubric score. (C) Explanation of Essay 124’s second revision. (D) Comparison of the explanations of all Essay 124’s versions.

To show how SHAP could be leveraged to offer remedial formative feedback, the revised version of Essay 124 will be explained again for eight different values of “adjacent_overlap_verb_sent” (0, 0.143, 0.286, 0.429, 0.571, 0.714, 0.857, 1.0), while keeping the values of all other features constant. The set of these eight essays are explained by a newly trained SHAP explainer (Gradient), producing new SHAP values for each feature and each “revised” essay. Notice how the new model, called the feedback model, allows to foresee by how much a novice writer can hope to improve his/her score according to the “Style” rubric. If the student employs different verbs at every sentence, the feedback model estimates that the rubric score could be improved from 3.47 up to 3.65 ( Figure 7B ). Notice that the dashed line represents Revision 1, while other lines simulate one of the seven other altered essays. Moreover, it is important to note how changing the value of a single feature may influence the contributions that other features may have on the predicted score. Again, all explanations look similar in terms of direction, but certain features differ in terms of the magnitude of their contributions. However, the reader should observe how the targeted feature varies not only in terms of magnitude, but also of direction, allowing the student to ponder the relevancy of executing the recommended writing strategy.

Thus, upon receiving this feedback, assume that a student sets the goal to improve the effectiveness of his/her verb choice by eliminating any redundant verb, producing Revision 2 in Table 8 . The student submits his essay again to the AES system, which finally gives a new rubric score of 3.98, a significant improvement from the previous 3.47, allowing the student to get a 4 instead of a 3. Figure 7C exhibits the decision plot of Revision 2. To better observe how the various revisions of the student’s essay changed over time, their respective explanations have been plotted in the same decision plot ( Figure 7D ). Notice this time that the ordering of the features has changed to list the features of common importance to all of the essay’s versions. The feature ordering in Figures 7A−C complies with the same ordering as in Figure 3D , the decision plot of the original essay. These figures underscore the importance of tracking the interaction between the various features so that the model understands well the impact that changing one feature has on the others. TreeSHAP, an implementation for tree-based models, offers this capability and its potential on improving the quality of feedback provided to novice writers will be tested in a future version of this AES system.

This paper serves as a proof of concept of the applicability of XAI techniques in automated essay scoring, providing learning analytics practitioners and educators with a methodology on how to “hire” AI markers and make them accountable to their human counterparts. In addition to debug predictive models, SHAP explanation models can serve as some formalism of a broader learning analytics platform, where aspects of prescriptive analytics (provision of remedial formative feedback) can be added on top of the more pervasive predictive analytics.

However, the main weakness of the approach put forward in this paper consists in omitting many types of spatio-temporal data. In other words, it ignores precious information inherent to the writing process, which may prove essential to guess the intent of the student, especially in contexts of poor sentence structures and high grammatical inaccuracy. Hence, this paper calls for adapting current NLP technologies to educational purposes, where the quality of writing may be suboptimal, which is contrary to many utopian scenarios where NLP is used for content analysis, opinion mining, topic modeling, or fact extraction trained on corpora of high-quality texts. By capturing the writing process preceding a submission of an essay to an AES tool, other kinds of explanation models can also be trained to offer feedback not only from a linguistic perspective but also from a behavioral one (e.g., composing vs. revising); that is, the AES system could inform novice writers about suboptimal and optimal writing strategies (e.g., planning a revision phase after bursts of writing).

In addition, associating sections of text with suboptimal writing features, those whose contributions lower the predicted score, would be much more informative. This spatial information would not only allow to point out what is wrong and but also where it is wrong, answering more efficiently the question why an essay is wrong. This problem could be simply approached through a multiple-inputs and mixed-data feature-based (MLP) neural network architecture fed by both linguistic indices and textual data ( n -grams), where the SHAP explanation model would assign feature contributions to both types of features and any potential interaction between them. A more complex approach could address the problem through special types of recurrent neural networks such as Ordered-Neurons LSTMs (long short-term memory), which are well adapted to the parsing of natural language, and where the natural sequence of text is not only captured but also its hierarchy of constituents ( Shen et al., 2018 ). After all, this paper highlights the fact that the potential of deep learning can reach beyond the training of powerful predictive models and be better visible in the higher trustworthiness of explanation models. This paper also calls for optimizing the training of predictive models by considering the descriptive accuracy of explanations and the human expert’s qualitative knowledge (e.g., indicating the direction of feature contributions) during the training process.

Data Availability Statement

The datasets and code of this study can be found in these Open Science Framework’s online repositories: https://osf.io/fxvru/ .

Author Contributions

VK architected the concept of an ethics-bound, semi-autonomous, and trust-enabled human-AI fusion that measures and advances knowledge boundaries in human learning, which essentially defines the key traits of learning analytics. DB was responsible for its implementation in the area of explainable automated essay scoring and for the training and validation of the predictive and explanation models. Together they offer an XAI-based proof of concept of a prescriptive model that can offer real-time formative remedial feedback to novice writers. Both authors contributed to the article and approved its publication.

Research reported in this article was supported by the Academic Research Fund (ARF) publication grant of Athabasca University under award number (24087).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/feduc.2020.572367/full#supplementary-material

  • ^ https://www.kaggle.com/c/asap-aes
  • ^ https://www.linguisticanalysistools.org/

Abbass, H. A. (2019). Social integration of artificial intelligence: functions, automation allocation logic and human-autonomy trust. Cogn. Comput. 11, 159–171. doi: 10.1007/s12559-018-9619-0

CrossRef Full Text | Google Scholar

Adadi, A., and Berrada, M. (2018). Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 6, 52138–52160. doi: 10.1109/ACCESS.2018.2870052

Amorim, E., Cançado, M., and Veloso, A. (2018). “Automated essay scoring in the presence of biased ratings,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , New Orleans, LA, 229–237.

Google Scholar

Arrieta, A. B., Díaz-Rodríguez, N., Ser, J., Del Bennetot, A., Tabik, S., Barbado, A., et al. (2020). Explainable Artificial Intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inform. Fusion 58, 82–115. doi: 10.1016/j.inffus.2019.12.012

Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., et al. (2007). The English lexicon project. Behav. Res. Methods 39, 445–459. doi: 10.3758/BF03193014

PubMed Abstract | CrossRef Full Text | Google Scholar

Boulanger, D., and Kumar, V. (2018). “Deep learning in automated essay scoring,” in Proceedings of the International Conference of Intelligent Tutoring Systems , eds R. Nkambou, R. Azevedo, and J. Vassileva (Cham: Springer International Publishing), 294–299. doi: 10.1007/978-3-319-91464-0_30

Boulanger, D., and Kumar, V. (2019). “Shedding light on the automated essay scoring process,” in Proceedings of the International Conference on Educational Data Mining , 512–515.

Boulanger, D., and Kumar, V. (2020). “SHAPed automated essay scoring: explaining writing features’ contributions to English writing organization,” in Intelligent Tutoring Systems , eds V. Kumar and C. Troussas (Cham: Springer International Publishing), 68–78. doi: 10.1007/978-3-030-49663-0_10

Chen, H., Lundberg, S., and Lee, S.-I. (2019). Explaining models by propagating Shapley values of local components. arXiv [Preprint]. Available online at: https://arxiv.org/abs/1911.11888 (accessed September 22, 2020).

Crossley, S. A., Bradfield, F., and Bustamante, A. (2019). Using human judgments to examine the validity of automated grammar, syntax, and mechanical errors in writing. J. Writ. Res. 11, 251–270. doi: 10.17239/jowr-2019.11.02.01

Crossley, S. A., Kyle, K., and McNamara, D. S. (2016). The tool for the automatic analysis of text cohesion (TAACO): automatic assessment of local, global, and text cohesion. Behav. Res. Methods 48, 1227–1237. doi: 10.3758/s13428-015-0651-7

Crossley, S. A., Kyle, K., and McNamara, D. S. (2017). Sentiment analysis and social cognition engine (SEANCE): an automatic tool for sentiment, social cognition, and social-order analysis. Behav. Res. Methods 49, 803–821. doi: 10.3758/s13428-016-0743-z

Dronen, N., Foltz, P. W., and Habermehl, K. (2015). “Effective sampling for large-scale automated writing evaluation systems,” in Proceedings of the Second (2015) ACM Conference on Learning @ Scale , 3–10.

Goldin, I., Narciss, S., Foltz, P., and Bauer, M. (2017). New directions in formative feedback in interactive learning environments. Int. J. Artif. Intellig. Educ. 27, 385–392. doi: 10.1007/s40593-016-0135-7

Hao, Q., and Tsikerdekis, M. (2019). “How automated feedback is delivered matters: formative feedback and knowledge transfer,” in Proceedings of the 2019 IEEE Frontiers in Education Conference (FIE) , Covington, KY, 1–6.

Hellman, S., Rosenstein, M., Gorman, A., Murray, W., Becker, L., Baikadi, A., et al. (2019). “Scaling up writing in the curriculum: batch mode active learning for automated essay scoring,” in Proceedings of the Sixth (2019) ACM Conference on Learning @ Scale , (New York, NY: Association for Computing Machinery).

Hussein, M. A., Hassan, H., and Nassef, M. (2019). Automated language essay scoring systems: a literature review. PeerJ Comput. Sci. 5:e208. doi: 10.7717/peerj-cs.208

Kumar, V., and Boulanger, D. (2020). Automated essay scoring and the deep learning black box: how are rubric scores determined? Int. J. Artif. Intellig. Educ. doi: 10.1007/s40593-020-00211-5

Kumar, V., Fraser, S. N., and Boulanger, D. (2017). Discovering the predictive power of five baseline writing competences. J. Writ. Anal. 1, 176–226.

Kyle, K. (2016). Measuring Syntactic Development In L2 Writing: Fine Grained Indices Of Syntactic Complexity And Usage-Based Indices Of Syntactic Sophistication. Dissertation, Georgia State University, Atlanta, GA.

Kyle, K., Crossley, S., and Berger, C. (2018). The tool for the automatic analysis of lexical sophistication (TAALES): version 2.0. Behav. Res. Methods 50, 1030–1046. doi: 10.3758/s13428-017-0924-4

Lundberg, S. M., Erion, G. G., and Lee, S.-I. (2018). Consistent individualized feature attribution for tree ensembles. arXiv [Preprint]. Available online at: https://arxiv.org/abs/1802.03888 (accessed September 22, 2020).

Lundberg, S. M., and Lee, S.-I. (2017). “A unified approach to interpreting model predictions,” in Advances in Neural Information Processing Systems , eds I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, et al. (Red Hook, NY: Curran Associates, Inc), 4765–4774.

Madnani, N., and Cahill, A. (2018). “Automated scoring: beyond natural language processing,” in Proceedings of the 27th International Conference on Computational Linguistics , (Santa Fe: Association for Computational Linguistics), 1099–1109.

Madnani, N., Loukina, A., von Davier, A., Burstein, J., and Cahill, A. (2017). “Building better open-source tools to support fairness in automated scoring,” in Proceedings of the First (ACL) Workshop on Ethics in Natural Language Processing , (Valencia: Association for Computational Linguistics), 41–52.

McCarthy, P. M., and Jarvis, S. (2010). MTLD, vocd-D, and HD-D: a validation study of sophisticated approaches to lexical diversity assessment. Behav. Res. Methods 42, 381–392. doi: 10.3758/brm.42.2.381

Mizumoto, T., Ouchi, H., Isobe, Y., Reisert, P., Nagata, R., Sekine, S., et al. (2019). “Analytic score prediction and justification identification in automated short answer scoring,” in Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications , Florence, 316–325.

Molnar, C. (2020). Interpretable Machine Learning . Abu Dhabi: Lulu

Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., and Yu, B. (2019). Definitions, methods, and applications in interpretable machine learning. Proc. Natl. Acad. Sci. U.S.A. 116, 22071–22080. doi: 10.1073/pnas.1900654116

Nelson, J., and Campbell, C. (2017). Evidence-informed practice in education: meanings and applications. Educ. Res. 59, 127–135. doi: 10.1080/00131881.2017.1314115

Rahimi, Z., Litman, D., Correnti, R., Wang, E., and Matsumura, L. C. (2017). Assessing students’ use of evidence and organization in response-to-text writing: using natural language processing for rubric-based automated scoring. Int. J. Artif. Intellig. Educ. 27, 694–728. doi: 10.1007/s40593-017-0143-2

Reinertsen, N. (2018). Why can’t it mark this one? A qualitative analysis of student writing rejected by an automated essay scoring system. English Austral. 53:52.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). “Why should i trust you?”: explaining the predictions of any classifier. CoRR, abs/1602.0. arXiv [Preprint]. Available online at: http://arxiv.org/abs/1602.04938 (accessed September 22, 2020).

Rupp, A. A. (2018). Designing, evaluating, and deploying automated scoring systems with validity in mind: methodological design decisions. Appl. Meas. Educ. 31, 191–214. doi: 10.1080/08957347.2018.1464448

Rupp, A. A., Casabianca, J. M., Krüger, M., Keller, S., and Köller, O. (2019). Automated essay scoring at scale: a case study in Switzerland and Germany. ETS Res. Rep. Ser. 2019, 1–23. doi: 10.1002/ets2.12249

Shen, Y., Tan, S., Sordoni, A., and Courville, A. C. (2018). Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks. CoRR, abs/1810.0. arXiv [Preprint]. Available online at: http://arxiv.org/abs/1810.09536 (accessed September 22, 2020).

Shermis, M. D. (2014). State-of-the-art automated essay scoring: competition, results, and future directions from a United States demonstration. Assess. Writ. 20, 53–76. doi: 10.1016/j.asw.2013.04.001

Taghipour, K. (2017). Robust Trait-Specific Essay Scoring using Neural Networks and Density Estimators. Dissertation, National University of Singapore, Singapore.

West-Smith, P., Butler, S., and Mayfield, E. (2018). “Trustworthy automated essay scoring without explicit construct validity,” in Proceedings of the 2018 AAAI Spring Symposium Series , (New York, NY: ACM).

Woods, B., Adamson, D., Miel, S., and Mayfield, E. (2017). “Formative essay feedback using predictive scoring models,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , (New York, NY: ACM), 2071–2080.

Keywords : explainable artificial intelligence, SHAP, automated essay scoring, deep learning, trust, learning analytics, feedback, rubric

Citation: Kumar V and Boulanger D (2020) Explainable Automated Essay Scoring: Deep Learning Really Has Pedagogical Value. Front. Educ. 5:572367. doi: 10.3389/feduc.2020.572367

Received: 14 June 2020; Accepted: 09 September 2020; Published: 06 October 2020.

Reviewed by:

Copyright © 2020 Kumar and Boulanger. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: David Boulanger, [email protected]

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Assessment Systems

What is Automated Essay Scoring?

Automated essay scoring (AES) is an important application of machine learning and artificial intelligence to the field of psychometrics and assessment.  In fact, it’s been around far longer than “machine learning” and “artificial intelligence” have been buzzwords in the general public!  The field of psychometrics has been doing such groundbreaking work for decades.

So how does AES work, and how can you apply it?

What is automated essay scoring?

The first and most critical thing to know is that there is not an algorithm that “reads” the student essays.  Instead, you need to train an algorithm.  That is, if you are a teacher and don’t want to grade your essays, you can’t just throw them in an essay scoring system.  You have to  actually grade the essays (or at least a large sample of them) and then use that data to fit a machine learning algorithm.  Data scientists use the term train the model , which sounds complicated, but if you have ever done simple linear regression, you have experience with training models.

There are three steps for automated essay scoring:

  • Establish your data set. Begin by gathering a substantial collection of student essays, ensuring a diverse range of topics and writing styles. Each essay should be meticulously graded by human experts to create a reliable and accurate benchmark. This data set forms the foundation of your automated scoring system, providing the necessary examples for the machine learning model to learn from.
  • Determine the features. Identify the key features that will serve as predictor variables in your model. These features might include grammar, syntax, vocabulary usage, coherence, structure, and argument strength. Carefully selecting these attributes is crucial as they directly impact the model’s ability to assess essays accurately. The goal is to choose features that are indicative of overall writing quality and are relevant to the scoring criteria.
  • Train the machine learning model. Use the established data set and selected features to train your machine learning model. This involves feeding the graded essays into the model, allowing it to learn the relationship between the features and the assigned grades. Through iterative training and validation processes, the model adjusts its algorithms to improve accuracy. Continuous refinement and testing ensure that the model can reliably score new, unseen essays with a high degree of precision.

Here’s an extremely oversimplified example:

  • You have a set of 100 student essays, which you have scored on a scale of 0 to 5 points.
  • The essay is on Napoleon Bonaparte, and you want students to know certain facts, so you want to give them “credit” in the model if they use words like: Corsica, Consul, Josephine, Emperor, Waterloo, Austerlitz, St. Helena.  You might also add other Features such as Word Count, number of grammar errors, number of spelling errors, etc.
  • You create a map of which students used each of these words, as 0/1 indicator variables.  You can then fit a multiple regression with 7 predictor variables (did they use each of the 7 words) and the 5 point scale as your criterion variable.  You can then use this model to predict each student’s score from just their essay text.

Obviously, this example is too simple to be of use, but the same general idea is done with massive, complex studies.  The establishment of the core features (predictive variables) can be much more complex, and models are going to be much more complex than multiple regression (neural networks, random forests, support vector machines).

Here’s an example of the very start of a data matrix for features, from an actual student essay.  Imagine that you also have data on the final scores, 0 to 5 points.  You can see how this is then a regression situation.

Examinee Word Count i_have best_jump move and_that the_kids well
1 307 0 1 2 0 0 1
2 164 0 0 1 0 0 0
3 348 1 0 1 0 0 0
4 371 0 1 1 0 0 0
5 446 0 0 0 0 0 2
6 364 1 0 0 0 1 1

How do you score the essay?

If they are on paper, then automated essay scoring won’t work unless you have an extremely good software for character recognition that converts it to a digital database of text.  Most likely, you have delivered the exam as an online assessment and already have the database.  If so, your platform should include functionality to manage the scoring process, including multiple custom rubrics.  An example of our   FastTest platform   is provided below.

FastTest_essay-marking

Some rubrics you might use:

  • Supporting arguments
  • Organization
  • Vocabulary / word choice

How do you pick the Features?

This is one of the key research problems.  In some cases, it might be something similar to the Napoleon example.  Suppose you had a complex item on Accounting, where examinees review reports and spreadsheets and need to summarize a few key points.  You might pull out a few key terms as features (mortgage amortization) or numbers (2.375%) and consider them to be Features.  I saw a presentation at Innovations In Testing 2022 that did exactly this.  Think of them as where you are giving the students “points” for using those keywords, though because you are using complex machine learning models, it is not simply giving them a single unit point.  It’s contributing towards a regression-like model with a positive slope.

In other cases, you might not know.  Maybe it is an item on an English test being delivered to English language learners, and you ask them to write about what country they want to visit someday.  You have no idea what they will write about.  But what you can do is tell the algorithm to find the words or terms that are used most often, and try to predict the scores with that.  Maybe words like “jetlag” or “edification” show up in students that tend to get high scores, while words like “clubbing” or “someday” tend to be used by students with lower scores.  The AI might also pick up on spelling errors.  I worked as an essay scorer in grad school, and I can’t tell you how many times I saw kids use “ludacris” (name of an American rap artist) instead of “ludicrous” when trying to describe an argument.  They had literally never seen the word used or spelled correctly.  Maybe the AI model finds to give that a negative weight.   That’s the next section!

How do you train a model?

Well, if you are familiar with data science, you know there are TONS of models, and many of them have a bunch of parameterization options.  This is where more research is required.  What model works the best on your particular essay, and doesn’t take 5 days to run on your data set?  That’s for you to figure out.  There is a trade-off between simplicity and accuracy.  Complex models might be accurate but take days to run.  A simpler model might take 2 hours but with a 5% drop in accuracy.  It’s up to you to evaluate.

If you have experience with Python and R, you know that there are many packages which provide this analysis out of the box – it is a matter of selecting a model that works.

How effective is automated essay scoring?

Well, as psychometricians love to say, “it depends.”  You need to do the model fitting research for each prompt and rubric.  It will work better for some than others.  The general consensus in research is that AES algorithms work as well as a second human, and therefore serve very well in that role.  But you shouldn’t use them as the only score; of course, that’s impossible in many cases.

Here’s a graph from some research we did on our algorithm, showing the correlation of human to AES.  The three lines are for the proportion of sample used in the training set; we saw decent results from only 10% in this case!  Some of the models correlated above 0.80 with humans, even though this is a small data set.   We found that the Cubist model took a fraction of the time needed by complex models like Neural Net or Random Forest; in this case it might be sufficiently powerful.

Automated essay scoring results

How can I implement automated essay scoring without writing code from scratch?

There are several products on the market.  Some are standalone, some are integrated with a human-based essay scoring platform.  ASC’s platform for automated essay scoring is SmartMarq; click here to learn more .  It is currently in a standalone approach like you see below, making it extremely easy to use.  It is also in the process of being integrated into our online assessment platform, alongside human scoring, to provide an efficient and easy way of obtaining a second or third rater for QA purposes.

Want to learn more?  Contact us to request a demonstration .

SmartMarq automated essay scoring

  • Latest Posts

Avatar for Nathan Thompson, PhD

Nathan Thompson, PhD

Latest posts by nathan thompson, phd ( see all ).

  • Psychometrics: Data Science for Assessment - June 5, 2024
  • Setting a Cutscore to Item Response Theory - June 2, 2024
  • What are technology enhanced items? - May 31, 2024

Assessment Systems Logo - white

Online Assessment

Psychometrics.

e-rater ®  Scoring Engine

Evaluates students’ writing proficiency with automatic scoring and feedback

Selection an option below to learn more.

How the e-rater engine uses AI technology

ETS is a global leader in educational assessment, measurement and learning science. Our AI technology, such as the e-rater ® scoring engine, informs decisions and creates opportunities for learners around the world.

The e-rater engine automatically:

  • assess and nurtures key writing skills
  • scores essays and provides feedback on writing using a model built on the theory of writing to assess both analytical and independent writing skills

About the e-rater Engine

This ETS capability identifies features related to writing proficiency.

How It Works

See how the e-rater engine provides scoring and writing feedback.

Custom Applications

Use standard prompts or develop your own custom model with ETS’s expertise.

Use in Criterion ® Service

Learn how the e-rater engine is used in the Criterion ® Service.

FEATURED RESEARCH

E-rater as a Quality Control on Human Scores

See All Research (PDF)

A man and woman standing by a city building window while looking at a tablet

Ready to begin? Contact us to learn how the e-rater service can enhance your existing program.

Young man with glasses and holding up a pen in a library

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • PeerJ Comput Sci

Logo of peerjcs

Automated language essay scoring systems: a literature review

Mohamed abdellatif hussein.

1 Information and Operations, National Center for Examination and Educational Evaluation, Cairo, Egypt

Hesham Hassan

2 Faculty of Computers and Information, Computer Science Department, Cairo University, Cairo, Egypt

Mohammad Nassef

Associated data.

The following information was supplied regarding data availability:

As this is a literature, review, there was no raw data.

Writing composition is a significant factor for measuring test-takers’ ability in any language exam. However, the assessment (scoring) of these writing compositions or essays is a very challenging process in terms of reliability and time. The need for objective and quick scores has raised the need for a computer system that can automatically grade essay questions targeting specific prompts. Automated Essay Scoring (AES) systems are used to overcome the challenges of scoring writing tasks by using Natural Language Processing (NLP) and machine learning techniques. The purpose of this paper is to review the literature for the AES systems used for grading the essay questions.

Methodology

We have reviewed the existing literature using Google Scholar, EBSCO and ERIC to search for the terms “AES”, “Automated Essay Scoring”, “Automated Essay Grading”, or “Automatic Essay” for essays written in English language. Two categories have been identified: handcrafted features and automatically featured AES systems. The systems of the former category are closely bonded to the quality of the designed features. On the other hand, the systems of the latter category are based on the automatic learning of the features and relations between an essay and its score without any handcrafted features. We reviewed the systems of the two categories in terms of system primary focus, technique(s) used in the system, the need for training data, instructional application (feedback system), and the correlation between e-scores and human scores. The paper includes three main sections. First, we present a structured literature review of the available Handcrafted Features AES systems. Second, we present a structured literature review of the available Automatic Featuring AES systems. Finally, we draw a set of discussions and conclusions.

AES models have been found to utilize a broad range of manually-tuned shallow and deep linguistic features. AES systems have many strengths in reducing labor-intensive marking activities, ensuring a consistent application of scoring criteria, and ensuring the objectivity of scoring. Although many techniques have been implemented to improve the AES systems, three primary challenges have been identified. The challenges are lacking of the sense of the rater as a person, the potential that the systems can be deceived into giving a lower or higher score to an essay than it deserves, and the limited ability to assess the creativity of the ideas and propositions and evaluate their practicality. Many techniques have only been used to address the first two challenges.

Introduction

Test items (questions) are usually classified into two types: selected-response (SR), and constructed-response (CR). The SR items, such as true/false, matching or multiple-choice, are much easier than the CR items in terms of objective scoring ( Isaacs et al., 2013 ). SR questions are commonly used for gathering information about knowledge, facts, higher-order thinking, and problem-solving skills. However, considerable skill is required to develop test items that measure analysis, evaluation, and other higher cognitive skills ( Stecher et al., 1997 ).

CR items, sometimes called open-ended, include two sub-types: restricted-response and extended-response items ( Nitko & Brookhart, 2007 ). Extended-response items, such as essays, problem-based examinations, and scenarios, are like restricted-response items, except that they extend the demands made on test-takers to include more complex situations, more difficult reasoning, and higher levels of understanding which are based on real-life situations requiring test-takers to apply their knowledge and skills to new settings or situations ( Isaacs et al., 2013 ).

In language tests, test-takers are usually required to write an essay about a given topic. Human-raters score these essays based on specific scoring rubrics or schemes. It occurs that the score of an essay scored by different human-raters vary substantially because human scoring is subjective ( Peng, Ke & Xu, 2012 ). As the process of human scoring takes much time, effort, and are not always as objective as required, there is a need for an automated essay scoring system that reduces cost, time and determines an accurate and reliable score.

Automated Essay Scoring (AES) systems usually utilize Natural Language Processing and machine learning techniques to automatically rate essays written for a target prompt ( Dikli, 2006 ). Many AES systems have been developed over the past decades. They focus on automatically analyzing the quality of the composition and assigning a score to the text. Typically, AES models exploit a wide range of manually-tuned shallow and deep linguistic features ( Farag, Yannakoudakis & Briscoe, 2018 ). Recent advances in the deep learning approach have shown that applying neural network approaches to AES systems has accomplished state-of-the-art results ( Page, 2003 ; Valenti, Neri & Cucchiarelli, 2017 ) with the additional benefit of using features that are automatically learnt from the data.

Survey methodology

The purpose of this paper is to review the AES systems literature pertaining to scoring extended-response items in language writing exams. Using Google Scholar, EBSCO and ERIC, we searched the terms “AES”, “Automated Essay Scoring”, “Automated Essay Grading”, or “Automatic Essay” for essays written in English language. AES systems which score objective or restricted-response items are excluded from the current research.

The most common models found for AES systems are based on Natural Language Processing (NLP), Bayesian text classification, Latent Semantic Analysis (LSA), or Neural Networks. We have categorized the reviewed AES systems into two main categories. The former is based on handcrafted discrete features bounded to specific domains. The latter is based on automatic feature extraction. For instance, Artificial Neural Network (ANN)-based approaches are capable of automatically inducing dense syntactic and semantic features from a text.

The literature of the two categories has been structurally reviewed and evaluated based on certain factors including: system primary focus, technique(s) used in the system, the need for training data, instructional application (feedback system), and the correlation between e-scores and human scores.

Handcrafted features AES systems

Project essay grader™ (peg).

Ellis Page developed the PEG in 1966. PEG is considered the earliest AES system that has been built in this field. It utilizes correlation coefficients to predict the intrinsic quality of the text. It uses the terms “trins” and “proxes” to assign a score. Whereas “trins” refers to intrinsic variables like diction, fluency, punctuation, and grammar,“proxes” refers to correlations between intrinsic variables such as average length of words in a text, and/or text length. ( Dikli, 2006 ; Valenti, Neri & Cucchiarelli, 2017 ).

The PEG uses a simple scoring methodology that consists of two stages. The former is the training stage and the latter is the scoring stage. PEG should be trained on a sample of essays from 100 to 400 essays, the output of the training stage is a set of coefficients ( β weights) for the proxy variables from the regression equation. In the scoring stage, proxes are identified for each essay, and are inserted into the prediction equation. To end, a score is determined by estimating coefficients ( β weights) from the training stage ( Dikli, 2006 ).

Some issues have been marked as a criticism for the PEG such as disregarding the semantic side of essays, focusing on surface structures, and not working effectively in case of receiving student responses directly (which might ignore writing errors). PEG has a modified version released in 1990, which focuses on grammar checking with a correlation between human assessors and the system ( r  = 0.87) ( Dikli, 2006 ; Page, 1994 ; Refaat, Ewees & Eisa, 2012 ).

Measurement Inc. acquired the rights of PEG in 2002 and continued to develop it. The modified PEG analyzes the training essays and calculates more than 500 features that reflect intrinsic characteristics of writing, such as fluency, diction, grammar, and construction. Once the features have been calculated, the PEG uses them to build statistical and linguistic models for the accurate prediction of essay scores ( Home—Measurement Incorporated, 2019 ).

Intelligent Essay Assessor™ (IEA)

IEA was developed by Landauer (2003) . IEA uses a statistical combination of several measures to produce an overall score. It relies on using Latent Semantic Analysis (LSA); a machine-learning model of human understanding of the text that depends on the training and calibration methods of the model and the ways it is used tutorially ( Dikli, 2006 ; Foltz, Gilliam & Kendall, 2003 ; Refaat, Ewees & Eisa, 2012 ).

IEA can handle students’ innovative answers by using a mix of scored essays and the domain content text in the training stage. It also spots plagiarism and provides feedback ( Dikli, 2006 ; Landauer, 2003 ). It uses a procedure for assigning scores in a process that begins with comparing essays to each other in a set. LSA examines the extremely similar essays. Irrespective of the replacement of paraphrasing, synonym, or reorganization of sentences, the two essays will be similar LSA. Plagiarism is an essential feature to overcome academic dishonesty, which is difficult to be detected by human-raters, especially in the case of grading a large number of essays ( Dikli, 2006 ; Landauer, 2003 ). ( Fig. 1 ) represents IEA architecture ( Landauer, 2003 ). IEA requires smaller numbers of pre-scored essays for training. On the contrary of other AES systems, IEA requires only 100 pre-scored training essays per each prompt vs. 300–500 on other systems ( Dikli, 2006 ).

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-05-208-g001.jpg

Landauer (2003) used IEA to score more than 800 students’ answers in middle school. The results showed a 0.90 correlation value between IEA and the human-raters. He explained the high correlation value due to several reasons including that human-raters could not compare each essay to each other for the 800 students while IEA can do so ( Dikli, 2006 ; Landauer, 2003 ).

E-rater ®

Educational Testing Services (ETS) developed E-rater in 1998 to estimate the quality of essays in various assessments. It relies on using a combination of statistical and NLP techniques to extract linguistic features (such as grammar, usage, mechanics, development) from text to start processing, then compares scores with human graded essays ( Attali & Burstein, 2014 ; Dikli, 2006 ; Ramineni & Williamson, 2018 ).

The E-rater system is upgraded annually. The current version uses 11 features divided into two areas: writing quality (grammar, usage, mechanics, style, organization, development, word choice, average word length, proper prepositions, and collocation usage), and content or use of prompt-specific vocabulary ( Ramineni & Williamson, 2018 ).

The E-rater scoring model consists of two stages: the model of the training stage, and the model of the evaluation stage. Human scores are used for training and evaluating the E-rater scoring models. The quality of the E-rater models and its effective functioning in an operational environment depend on the nature and quality of the training and evaluation data ( Williamson, Xi & Breyer, 2012 ). The correlation between human assessors and the system ranged from 0.87 to 0.94 ( Refaat, Ewees & Eisa, 2012 ).

Criterion SM

Criterion is a web-based scoring and feedback system based on ETS text analysis tools: E-rater ® and Critique. As a text analysis tool, Critique integrates a collection of modules that detect faults in usage, grammar, and mechanics, and recognizes discourse and undesirable style elements in writing. It provides immediate holistic scores as well ( Crozier & Kennedy, 1994 ; Dikli, 2006 ).

Criterion similarly gives personalized diagnostic feedback reports based on the types of assessment instructors give when they comment on students’ writings. This component of the Criterion is called an advisory component. It is added to the score, but it does not control it[18]. The types of feedback the advisory component may provide are like the following:

  • • The text is too brief (a student may write more).
  • • The essay text does not look like other essays on the topic (the essay is off-topic).
  • • The essay text is overly repetitive (student may use more synonyms) ( Crozier & Kennedy, 1994 ).

IntelliMetric™

Vantage Learning developed the IntelliMetric systems in 1998. It is considered the first AES system which relies on Artificial Intelligence (AI) to simulate the manual scoring process carried out by human-raters under the traditions of cognitive processing, computational linguistics, and classification ( Dikli, 2006 ; Refaat, Ewees & Eisa, 2012 ).

IntelliMetric relies on using a combination of Artificial Intelligence (AI), Natural Language Processing (NLP) techniques, and statistical techniques. It uses CogniSearch and Quantum Reasoning technologies that were designed to enable IntelliMetric to understand the natural language to support essay scoring ( Dikli, 2006 ).

IntelliMetric uses three steps to score essays as follows:

  • a) First, the training step that provides the system with known scores essays.
  • b) Second, the validation step examines the scoring model against a smaller set of known scores essays.
  • c) Finally, application to new essays with unknown scores. ( Learning, 2000 ; Learning, 2003 ; Shermis & Barrera, 2002 )

IntelliMetric identifies text related characteristics as larger categories called Latent Semantic Dimensions (LSD). ( Figure 2 ) represents the IntelliMetric features model.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-05-208-g002.jpg

IntelliMetric scores essays in several languages including English, French, German, Arabic, Hebrew, Portuguese, Spanish, Dutch, Italian, and Japanese ( Elliot, 2003 ). According to Rudner, Garcia, and Welch ( Rudner, Garcia & Welch, 2006 ), the average of the correlations between IntelliMetric and human-raters was 0.83 ( Refaat, Ewees & Eisa, 2012 ).

MY Access is a web-based writing assessment system based on the IntelliMetric AES system. The primary aim of this system is to provide immediate scoring and diagnostic feedback for the students’ writings in order to motivate them to improve their writing proficiency on the topic ( Dikli, 2006 ).

MY Access system contains more than 200 prompts that assist in an immediate analysis of the essay. It can provide personalized Spanish and Chinese feedback on several genres of writing such as narrative, persuasive, and informative essays. Moreover, it provides multilevel feedback—developing, proficient, and advanced—as well ( Dikli, 2006 ; Learning, 2003 ).

Bayesian Essay Test Scoring System™ (BETSY)

BETSY classifies the text based on trained material. It has been developed in 2002 by Lawrence Rudner at the College Park of the University of Maryland with funds from the US Department of Education ( Valenti, Neri & Cucchiarelli, 2017 ). It has been designed to automate essay scoring, but can be applied to any text classification task ( Taylor, 2005 ).

BETSY needs to be trained on a huge number (1,000 texts) of human classified essays to learn how to classify new essays. The goal of the system is to determine the most likely classification of an essay to a set of groups (Pass-Fail) and (Advanced - Proficient - Basic - Below Basic) ( Dikli, 2006 ; Valenti, Neri & Cucchiarelli, 2017 ). It learns how to classify a new document through the following steps:

The first-word training step is concerned with the training of words, evaluating database statistics, eliminating infrequent words, and determining stop words.

The second-word pairs training step is concerned with evaluating database statistics, eliminating infrequent word-pairs, maybe scoring the training set, and trimming misclassified training sets.

Finally, BETSY can be applied to a set of experimental texts to identify the classification precision for several new texts or a single text. ( Dikli, 2006 )

BETSY has achieved accuracy of over 80%, when trained with 462 essays, and tested with 80 essays ( Rudner & Liang, 2002 ).

Automatic featuring AES systems

Automatic text scoring using neural networks.

Alikaniotis, Yannakoudakis, and Rei introduced in 2016 a deep neural network model capable of learning features automatically to score essays. This model has introduced a novel method to identify the more discriminative regions of the text using: (1) a Score-Specific Word Embedding (SSWE) to represent words and (2) a two-layer Bidirectional Long-Short-Term Memory (LSTM) network to learn essay representations. ( Alikaniotis, Yannakoudakis & Rei, 2016 ; Taghipour & Ng, 2016 ).

Alikaniotis and his colleagues have extended the C&W Embeddings model into the Augmented C&W model to capture, not only the local linguistic environment of each word, but also how each word subsidizes to the overall score of an essay. In order to capture SSWEs . A further linear unit has been added in the output layer of the previous model which performs linear regression, predicting the essay score ( Alikaniotis, Yannakoudakis & Rei, 2016 ). Figure 3 shows the architectures of two models, (A) Original C&W model and (B) Augmented C&W model. Figure 4 shows the example of (A) standard neural embeddings to (B) SSWE word embeddings.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-05-208-g003.jpg

(A) Original C&W model. (B) Augmented C&W model.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-05-208-g004.jpg

(A) Standard neural embeddings. (B) SSWE word embeddings.

SSWEs obtained by their model used to derive continuous representations for each essay. Each essay is identified as a sequence of tokens. The uni- and bi-directional LSTMs have been efficiently used for embedding long sequences ( Alikaniotis, Yannakoudakis & Rei, 2016 ).

They used the Kaggle’s ASAP ( https://www.kaggle.com/c/asap-aes/data ) contest dataset. It consists of 12.976 essays, with average length 150-to-550 words per essay, each double marked (Cohen’s = 0.86). The essays presented eight different prompts, each with distinct marking criteria and score range.

Results showed that SSWE and the LSTM approach, without any prior knowledge of the language grammar or the text domain, was able to mark the essays in a very human-like way, beating other state-of-the-art systems. Furthermore, while tuning the models’ hyperparameters on a separate validation set ( Alikaniotis, Yannakoudakis & Rei, 2016 ), they did not perform any further preprocessing of the text other than simple tokenization. Also, it outperforms the traditional SVM model by combining SSWE and LSTM. On the contrary, LSTM alone did not give significant more accuracies compared to SVM.

According to Alikaniotis, Yannakoudakis, and Rei ( Alikaniotis, Yannakoudakis & Rei, 2016 ), the combination of SSWE with the two-layer bi-directional LSTM had the highest correlation value on the test set averaged 0.91 (Spearman) and 0.96 (Pearson).

A neural network approach to automated essay scoring

Taghipour and H. T. Ng developed in 2016 a Recurrent Neural Networks (RNNs) approach which automatically learns the relation between an essay and its grade. Since the system is based on RNNs, it can use non-linear neural layers to identify complex patterns in the data and learn them, and encode all the information required for essay evaluation and scoring ( Taghipour & Ng, 2016 ).

The designed model architecture can be presented in five layers as follow:

  • a) The Lookup Table Layer; which builds d LT dimensional space containing each word projection.
  • b) The Convolution Layer; which extracts feature vectors from n-grams. It can possibly capture local contextual dependencies in writing and therefore enhance the performance of the system.
  • c) The Recurrent Layer; which processes the input to generate a representation for the given essay.
  • d) The Mean over Time; which aggregates the variable number of inputs into a fixed length vector.
  • e) The Linear Layer with Sigmoid Activation; which maps the generated output vector from the mean-over-time layer to a scalar value ( Taghipour & Ng, 2016 ).

Taghipour and his colleagues have used the Kaggle’s ASAP contest dataset. They distributed the data set into 60% training set, 20% a development set, and 20% a testing set. They used Quadratic Weighted Kappa (QWK) as an evaluation metric. For evaluating the performance of the system, they compared it to an available open source AES system called the ‘Enhanced AI Scoring Engine’ (EASE) ( https://github.com/edx/ease ). To identify the best model, they performed several experiments like Convolutional vs. Recurrent Neural Network, basic RNN vs. Gated Recurrent Units (GRU) vs. LSTM, unidirectional vs. Bidirectional LSTM, and using with vs. without mean-over-time layer ( Taghipour & Ng, 2016 ).

The results showed multiple observations according to ( Taghipour & Ng, 2016 ), summarized as follows:

  • a) RNN failed to get accurate results as LSTM or GRU and the other models outperformed it. This was possibly due to the relatively long sequences of words in writing.
  • b) The neural network performance was significantly affected with the absence of the mean over-time layer. As a result, it did not learn the task in an exceedingly proper manner.
  • c) The best model was the combination of ten instances of LSTM models with ten instances of CNN models. The new model outperformed the baseline EASE system by 5.6% and with averaged QWK value 0.76.

Automatic features for essay scoring—an empirical study

Dong and Zhang provided in 2016 an empirical study to examine a neural network method to learn syntactic and semantic characteristics automatically for AES, without the need for external pre-processing. They built a hierarchical Convolutional Neural Network (CNN) structure with two levels in order to model sentences separately ( Dasgupta et al., 2018 ; Dong & Zhang, 2016 ).

Dong and his colleague built a model with two parts, summarized as follows:

  • a) Word Representations: A word embedding is used but does not rely on POS-tagging or other pre-processing.
  • b) CNN Model: They took essay scoring as a regression task and employed a two-layer CNN model, in which one Convolutional layer is used to extract sentences representations, and the other is stacked on sentence vectors to learn essays representations.

The dataset that they employed in experiments is the Kaggle’s ASAP contest dataset. The settings of data preparation followed the one that Phandi, Chai, and Ng used ( Phandi, Chai & Ng, 2015 ). For domain adaptation (cross-domain) experiments, they followed Phandi, Chai, and Ng ( Phandi, Chai & Ng, 2015 ), by picking four pairs of essay prompts, namely, 1 → 2, 3 →4, 5 →6 and 7 →8, where 1 →2 denotes prompt one as source domain and prompt 2 as target domain. They used quadratic weighted Kappa (QWK) as the evaluation metric.

In order to evaluate the performance of the system, they compared it to EASE system (an open source AES available for public) with its both models Bayesian Linear Ridge Regression (BLRR) and Support Vector Regression (SVR).

The Empirical results showed that the two-layer Convolutional Neural Network (CNN) outperformed other baselines (e.g., Bayesian Linear Ridge Regression) on both in-domain and domain adaptation experiments on the Kaggle’s ASAP contest dataset. So, the neural features learned by CNN were very effective in essay marking, handling more high-level and abstract information compared to manual feature templates. In domain average, QWK value was 0.73 vs. 0.75 for human rater ( Dong & Zhang, 2016 ).

Augmenting textual qualitative features in deep convolution recurrent neural network for automatic essay scoring

In 2018, Dasgupta et al. (2018) proposed a Qualitatively enhanced Deep Convolution Recurrent Neural Network architecture to score essays automatically. The model considers both word- and sentence-level representations. Using a Hierarchical CNN connected with a Bidirectional LSTM model they were able to consider linguistic, psychological and cognitive feature embeddings within a text ( Dasgupta et al., 2018 ).

The designed model architecture for the linguistically informed Convolution RNN can be presented in five layers as follow:

  • a) Generating Embeddings Layer: The primary function is constructing previously trained sentence vectors. Sentence vectors extracted from every input essay are appended with the formed vector from the linguistic features determined for that sentence.
  • b) Convolution Layer: For a given sequence of vectors with K windows, this layer function is to apply linear transformation for all these K windows. This layer is fed by each of the generated word embeddings from the previous layer.
  • c) Long Short-Term Memory Layer: The main function of this layer is to examine the future and past sequence context by connecting Bidirectional LSTMs (Bi-LSTM) networks.
  • d) Activation layer: The main function of this layer is to obtain the intermediate hidden layers from the Bi-LSTM layer h 1 , h 2 ,…, h T , and in order to calculate the weights of sentence contribution to the final essay’s score (quality of essay). They used an attention pooling layer over sentence representations.
  • e) The Sigmoid Activation Function Layer: The main function of this layer is to perform a linear transformation of the input vector that converts it to a scalar value (continuous) ( Dasgupta et al., 2018 ).

Figure 5 represents the proposed linguistically informed Convolution Recurrent Neural Network architecture.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-05-208-g005.jpg

Dasgupta and his colleagues employed in their experiments the Kaggle’s ASAP contest dataset. They have done 7 folds using cross validation technique to assess their models. Every fold is distributed as follows; training set which represents 80% of the data, development set represented by 10%, and the rest 10% as the test set. They used quadratic weighted Kappa (QWK) as the evaluation metric.

The results showed that, in terms of all these parameters, the Qualitatively Enhanced Deep Convolution LSTM (Qe-C-LSTM) system performed better than the existing, LSTM, Bi-LSTM and EASE models. It achieved a Pearson’s and Spearman’s correlation of 0.94 and 0.97 respectively as compared to that of 0.91 and 0.96 in Alikaniotis, Yannakoudakis & Rei (2016) . They also accomplished an RMSE score of 2.09. They computed a pairwise Cohen’s k value of 0.97 as well ( Dasgupta et al., 2018 ).

Summary and Discussion

Over the past four decades, there have been several studies that examined the approaches of applying computer technologies on scoring essay questions. Recently, computer technologies have been able to assess the quality of writing using AES technology. Many attempts have taken place in developing AES systems in the past years ( Dikli, 2006 ).

The AES systems do not assess the intrinsic qualities of an essay directly as human-raters do, but they utilize the correlation coefficients of the intrinsic qualities to predict the score to be assigned to an essay. The performance of these systems is evaluated based on the comparison of the scores assigned to a set of essays scored by expert humans.

The AES systems have many strengths mainly in reducing labor-intensive marking activities, overcoming time, cost, and improving the reliability of writing tasks. Besides, they ensure a consistent application of marking criteria, therefore facilitating equity in scoring. However, there is a substantial manual effort involved in reaching these results on different domains, genres, prompts, and so forth. Moreover, the linguistic features intended to capture the aspects of writing to be assessed are hand-selected and tuned for specific domains. In order to perform well on different data, separate models with distinct feature sets are typically tuned ( Burstein, 2003 ; Dikli, 2006 ; Hamp-Lyons, 2001 ; Rudner & Gagne, 2001 ; Rudner & Liang, 2002 ). Despite its weaknesses, the AES systems continue to attract the attention of public schools, universities, testing agencies, researchers and educators ( Dikli, 2006 ).

The AES systems described in this paper under the first category are based on handcrafted features and, usually, rely on regression methods. They employ several methods to obtain the scores. While E-rater and IntelliMetric use NLP techniques, the IEA system utilizes LSA. Moreover, PEG utilizes proxy measures (proxes), and BETSY™ uses Bayesian procedures to evaluate the quality of a text.

While E-rater, IntelliMetric, and BETSY evaluate style and semantic content of essays, PEG only evaluates style and ignores the semantic aspect of essays. Furthermore, IEA is exclusively concerned with semantic content. Unlike PEG, E-rater, IntelliMetric, and IEA need smaller numbers of pre-scored essays for training in contrast with BETSY which needs a huge number of training pre-scored essays.

The systems in the first category have high correlations with human-raters. While PEG, E-rater, IEA, and BETSY evaluate only English language essay responses, IntelliMetric evaluates essay responses in multiple languages.

Contrary to PEG, IEA, and BETSY, E-rater, and IntelliMetric have instructional or immediate feedback applications (i.e., Criterion and MY Access!). Instructional-based AES systems have worked hard to provide formative assessments by allowing students to save their writing drafts on the system. Thus, students can review their writings as of the formative feedback received from either the system or the teacher. The recent version of MY Access! (6.0) provides online portfolios and peer review.

The drawbacks of this category may include the following: (a) feature engineering can be time-consuming, since features need to be carefully handcrafted and selected to fit the appropriate model, and (b) such systems are sparse and instantiated by discrete pattern-matching.

AES systems described in this paper under the second category are usually based on neural networks. Neural Networking approaches, especially Deep Learning techniques, have been shown to be capable of inducing dense syntactic and semantic features automatically, applying them to text analysis and classification problems including AES systems ( Alikaniotis, Yannakoudakis & Rei, 2016 ; Dong & Zhang, 2016 ; Taghipour & Ng, 2016 ), and giving better results with regards to the statistical models used in the handcrafted features ( Dong & Zhang, 2016 ).

Recent advances in Deep Learning have shown that neural approaches to AES achieve state-of-the-art results ( Alikaniotis, Yannakoudakis & Rei, 2016 ; Taghipour & Ng, 2016 ) with the additional advantage of utilizing features that are automatically learned from the data. In order to facilitate interpretability of neural models, a number of visualization techniques have been proposed to identify textual (superficial) features that contribute to model performance [7].

While Alikaniotis and his colleagues ( 2016 ) employed a two-layer Bidirectional LSTM combined with the SSWE for essay scoring tasks, Taghipour & Ng (2016) adopted the LSTM model and combined it with CNN. Dong & Zhang (2016) developed a two-layer CNN, and Dasgupta and his colleagues ( 2018 ) proposed a Qualitatively Enhanced Deep Convolution LSTM. Unlike Alikaniotis and his colleagues ( 2016 ), Taghipour & Ng (2016) , Dong & Zhang (2016) , Dasgupta and his colleagues ( 2018 ) were interested in word-level and sentence-level representations as well as linguistic, cognitive and psychological feature embeddings. All linguistic and qualitative features were figured off-line and then entered in the Deep Learning architecture.

Although Deep Learning-based approaches have achieved better performance than the previous approaches, the performance may not be better using the complex linguistic and cognitive characteristics, which are very important in modeling such essays. See Table 1 for the comparison of AES systems.

PEG™Ellis Page1966StyleStatisticalYes (100 –400)No0.87
IEA™Landauer, Foltz, & Laham1997ContentLSA (KAT engine by PEARSON)Yes (∼100)Yes0.90
E-rater ETS development team1998Style & ContentNLPYes (∼400)Yes (Criterion)∼0.91
IntelliMetric™Vantage Learning1998Style & ContentNLPYes (∼300)Yes (MY Access!)∼0.83
BETSY™Rudner1998Style & ContentBayesian text classificationYes (1000)No∼0.80
Alikaniotis, Yannakoudakis, and Rei2016Style & ContentSSWE + Two-layer Bi-LSTMYes (∼8000)No∼0.91 (Spearman) ∼0.96 (Pearson)
Taghipour and Ng2016Style & ContentAdopted LSTMYes (∼7786)NOQWK for LSTM ∼0.761
Dong and Zhang2016Syntactic and semantic featuresWord embedding and a two-layer Convolution Neural NetworkYes (∼1500 to ∼1800)NOaverage kappa ∼0.734 versus 0.754 for human
Dasgupta, T., Naskar, A., Dey, L., & Saha, R.2018Style, Content, linguistic and psychologicalDeep Convolution Recurrent Neural NetworkYes ( ∼8000 to 10000)NOPearson’s and Spearman’s correlation of 0.94 and 0.97 respectively

In general, there are three primary challenges to AES systems. First, they are not able to assess essays as human-raters do because they do what they have been programmed to do ( Page, 2003 ). They eliminate the human element in writing assessment and lack the sense of the rater as a person ( Hamp-Lyons, 2001 ). This shortcoming was somehow overcome by obtaining high correlations between the computer and human-raters ( Page, 2003 ) although this is still a challenge.

The second challenge is whether the computer can be fooled by students or not ( Dikli, 2006 ). It is likely to “trick” the system by writing a longer essay to obtain higher score for example ( Kukich, 2000 ). Studies, such as the GRE study in 2001, examined whether a computer could be deceived and assign a lower or higher score to an essay than it should deserve or not. The results revealed that it might reward a poor essay ( Dikli, 2006 ). The developers of AES systems have been utilizing algorithms to detect students who try to cheat.

Although automatic learning AES systems based on Neural Networks algorithms, the handcrafted AES systems transcend automatic learning systems in one important feature. Handcrafted systems are highly related to the scoring rubrics that have been designed as a criterion for assessing a specific essay and human-raters use these rubrics to score essays a well. The objectivity of human-raters is measured by their commitment to the scoring rubrics. On the contrary, automatic learning systems extract the scoring criteria using machine learning and neural networks, which may include some factors that are not part of the scoring rubric, and, hence, is reminiscent of raters’ subjectivity (i.e., mode, nature of a rater’s character, etc.) Considering this point, handcrafted AES systems may be considered as more objective and fairer to students from the viewpoint of educational assessment.

The third challenge is measuring the creativity of human writing. Accessing the creativity of ideas and propositions and evaluating their practicality are still a pending challenge to both categories of AES systems which still needs further research.

Funding Statement

The authors received no funding for this work.

Additional Information and Declarations

The authors declare there are no competing interests.

Mohamed Abdellatif Hussein conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables.

Hesham Hassan and Mohammad Nassef authored or reviewed drafts of the paper, approved the final draft.

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: inconsolata

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices .

From Automation to Augmentation: Large Language Models Elevating Essay Scoring Landscape

Receiving immediate and personalized feedback is crucial for second-language learners, and Automated Essay Scoring (AES) systems are a vital resource when human instructors are unavailable. This study investigates the effectiveness of Large Language Models (LLMs), specifically GPT-4 and fine-tuned GPT-3.5, as tools for AES. Our comprehensive set of experiments, conducted on both public and private datasets, highlights the remarkable advantages of LLM-based AES systems. They include superior accuracy, consistency, generalizability, and interpretability, with fine-tuned GPT-3.5 surpassing traditional grading models. Additionally, we undertake LLM-assisted human evaluation experiments involving both novice and expert graders. One pivotal discovery is that LLMs not only automate the grading process but also enhance the performance of human graders. Novice graders when provided with feedback generated by LLMs, achieve a level of accuracy on par with experts, while experts become more efficient and maintain greater consistency in their assessments. These results underscore the potential of LLMs in educational technology, paving the way for effective collaboration between humans and AI, ultimately leading to transformative learning experiences through AI-generated feedback.

Changrong Xiao 1 1 {}^{1} start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT ,  Wenxing Ma 2 2 {}^{2} start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT  Sean Xin Xu 1 1 {}^{1} start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT , Kunpeng Zhang 3 3 {}^{3} start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT ,   Yufang Wang 4 4 {}^{4} start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT ,   Qi Fu 4 4 {}^{4} start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT 1 1 {}^{1} start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Center for AI and Management (AIM), School of Economics and Management, Tsinghua University 2 2 {}^{2} start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT School of Economics and Management, Tsinghua University 3 3 {}^{3} start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT Department of Decision, Operations & Information Technologies, University of Maryland 4 4 {}^{4} start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT Beijing Xicheng Educational Research Institute [email protected] ,   [email protected] ,   [email protected] , [email protected] ,   [email protected] ,   [email protected]

1 Introduction

Refer to caption

English learning is an integral part of the high school curriculum in China, with a particular emphasis on writing practice. While timely and reliable feedback is essential for improving students’ proficiency, it presents a significant challenge for educators to provide individualized feedback, due to the high student-teacher ratio in China. This limitation hinders students’ academic progress, especially those who aspire to enhance their self-directed learning. Hence, the development of automated systems capable of delivering accurate and constructive feedback and assessment scores carries immense importance in this context.

Automated Essay Scoring (AES) systems provide valuable assistance to students by offering immediate and consistent feedback on their work, while also simplifying the grading process for educators. However, the effective implementation of AES systems in real-world educational settings presents several challenges. One of the primary challenges is the diverse range of exercise contexts and the inherent ambiguity in scoring rubrics. Take, for example, the case of Chinese high school students who engage in various writing exercises as part of their preparation for the College Entrance Examination. Although established scoring guidelines for these exams are widely recognized by English educators, they often lack the necessary granularity, especially when assessing abstract evaluation criteria such as logical structure. Furthermore, interviews conducted with high school teachers have revealed that subjective elements and personal experiences frequently exert influence over the grading process. These intricacies and complexities introduce significant hurdles when it comes to ensuring the accuracy, generalizability, and interpretability of AES systems.

To address these challenges, it is important to highlight recent advancements in the field of Natural Language Processing (NLP), particularly the development of large language models (LLMs). A notable example is OpenAI’s ChatGPT 1 1 1 https://chat.openai.com , which showcases impressive capabilities. ChatGPT not only demonstrates robust logical reasoning but also displays a remarkable ability to comprehend and adhere to human instructions (Ouyang et al., 2022 ) . Moreover, recent studies have further underscored the potential of leveraging LLMs in AES tasks (Mizumoto and Eguchi, 2023 ; Yancey et al., 2023 ; Naismith et al., 2023 ) .

In this study, we employed GPT-3.5 and GPT-4 as the foundational LLMs for our investigation. We carefully designed appropriate prompts for LLMs, instructing them to evaluate essays and provide detailed explanations. Additionally, we enhance the performance of GPT-3.5 by fine-tuning it using annotated datasets. We conducted extensive experiments under both publicly available essay-scoring datasets and a proprietary private dataset of student essays. To further assess the potential of LLMs in enhancing human grading, we conducted human evaluation experiments involving both novice and expert graders. These experiments yielded compelling insights into the educational context and potential avenues for effective collaboration between humans and AI. In summary, our study makes three significant contributions:

We pioneer the exploration of LLMs’ capabilities as AES systems, especially in intricate scenarios with tailored grading criteria. Our best fine-tuned GPT-3.5 model exhibits superior accuracy, consistency, and generalizability, coupled with the ability to provide detailed explanations and recommendations.

We introduce a substantial essay-scoring dataset, comprising 6,559 essays written by Chinese high school students, along with multi-dimensional scores provided by expert educators. This dataset enriches the resources available for research in the field of AI in Education (AIEd) 2 2 2 Codes and resources can be found in our GitHub repository https://github.com/Xiaochr/LLM-AES. .

The most significant implications of our study emerge from the LLM-assisted human evaluation experiments. Our findings underscore the potential of LLM-generated feedback to elevate the capabilities of individuals with limited domain knowledge to a level comparable to experts. These insights pave the way for future research in the realm of human-AI collaboration and AI-assisted learning in educational contexts.

2 Related Work

2.1 automated essay scoring (aes).

Automated Essay Scoring (AES) stands as a pivotal research area at the intersection of NLP and education. Traditional AES methods usually involve a two-stage process, as outlined in (Ramesh and Sanampudi, 2022 ) . First, features are extracted from the essay, including statistical features (Miltsakaki and Kukich, 2004 ; Ridley et al., 2020 ) and latent vector representations (Mikolov et al., 2013 ; Pennington et al., 2014 ) . Subsequently, regression-based or classification-based machine learning models are employed to predict the essay’s score (Sultan et al., 2016 ; Mathias and Bhattacharyya, 2018b , a ; Salim et al., 2019 ) .

With the advancement of deep learning, AES has witnessed the integration of advanced techniques such as convolutional neural networks (CNNs), long short-term memory networks (LSTMs), Attention-based models, and other deep learning technologies. These innovations have led to more precise score predictions (Dong and Zhang, 2016 ; Taghipour and Ng, 2016 ; Riordan et al., 2017 ) .

2.2 LLM Applications in AES

The domain of AES has also experienced advancements with the incorporation of pre-trained language models to enhance performance. Rodriguez et al. ( 2019 ); Lun et al. ( 2020 ) utilized Bidirectional Encoder Representations from Transformers (BERT (Devlin et al., 2018 ) ) to automatically evaluate essays and short answers. Additionally, Yang et al. ( 2020 ) improved BERT’s performance by fine-tuning it through a combination of regression and ranking loss, while Wang et al. ( 2022 ) employed BERT for jointly learning multi-scale essay representations.

Recent studies have explored The potential of leveraging the capabilities of the modern LLMs in AES tasks. Mizumoto and Eguchi ( 2023 ) provided ChatGPT with specific IELTS scoring rubrics for essay evaluation but found limited improvements when incorporating GPT scores into the regression model. Similarly, Han et al. ( 2023 ) introduced an automated scoring framework that did not outperform the BERT baseline. In a different approach, Yancey et al. ( 2023 ) used GPT-4’s few-shot capabilities to predict Common European Framework of Reference for Languages (CEFR) levels for short essays written by second-language learners. However, the Quadratic Weighted Kappa (QWK) scores did not surpass those achieved by the baseline model trained with XGBoost or human annotators.

Building on these insights, our study aims to further investigate the effectiveness of LLMs in AES tasks. We focus on more complex contexts and leverage domain-specific datasets to fine-tune LLMs for enhanced prediction performance. This research area offers promising avenues for future exploration and improvement.

ASAP dataset

Our chinese student english essay dataset, 4.1 essay scoring.

In this section, we will present the methods employed in the experiments of this study. The methods can be broadly divided into two main components: prompt engineering and further fine-tuning with the training dataset. We harnessed the power of OpenAI’s GPT-series models to generate both essay scores and feedback, specifically leveraging the zero-shot and few-shot capabilities of gpt-4 , as well as gpt-3.5-turbo for fine-tuning purposes.

To create appropriate prompts for different scenarios, our approach began with the development of initial instructions, followed by their refinement using GPT-4. An illustrative example of a prompt and its corresponding model-generated output can be found in Table 9 in the Appendices.

In this study, we considered various grading approaches, including zero-shot, few-shot, fine-tuning, and the baseline, which are as follows:

GPT-4, zero-shot, without rubrics

In this setting, we simply provide the prompt and the target essay to GPT-4. The model then evaluates the essay and assigns a score based on its comprehension within the specified score range.

GPT-4, zero-shot, with rubrics

Alongside the prompt and the target essay, we also provide GPT-4 with explicit scoring rubrics, guiding its evaluation.

GPT-4, few-shot, with rubrics

In addition to the zero-shot settings, the few-shot prompts include sample essays and their corresponding scores. This assists GPT-4 in understanding the latent scoring patterns. With the given prompt, target essay, scoring rubrics, and a set of k 𝑘 k italic_k essay examples, GPT-4 provides an appropriate score reflecting this enriched context.

As indicated by prior studies in AES tasks (Yancey et al., 2023 ) , increasing the value of k 𝑘 k italic_k did not consistently yield better results, showing a trend of diminishing marginal returns. Therefore, we choose a suitable k = 3 𝑘 3 k=3 italic_k = 3 as described in the study.

We explored two approaches for selecting the sample essays. The first approach involved randomly selecting essays from various levels of quality to help LLM understand the approximate level of the target essay. The second method adopted a retrieval-based approach, which has been proven to be effective in enhancing LLM performance (Khandelwal et al., 2020 ; Shi et al., 2023 ; Ram et al., 2023 ) . Leveraging OpenAI’s text-embedding-ada-002 model, we calculated the embedding for each essay. This allowed us to identify the top k 𝑘 k italic_k similar essays based on cosine similarity (excluding the target essay from the selection). Our experiments demonstrated that this retrieval strategy consistently yielded superior results. Therefore, we focused on the outcomes of this approach in the subsequent sections.

In all these configurations, we adopted the Chain-of-Thought (CoT) (Wei et al., 2022 ) strategy. This approach instructed the LLM to first analyze and explain the provided materials before making final score determinations. Research studies (Lampinen et al., 2022 ; Zhou et al., 2023 ; Li et al., 2023 ) have shown that this structured approach significantly enhances the capabilities of the LLM, optimizing performance in tasks that require inference and reasoning.

Fine-tuned GPT-3.5

We conducted further investigations into the effectiveness of supervised fine-tuning methods. Specifically, we employed OpenAI’s gpt-3.5-turbo and fine-tuned it individually for each dataset. This find-tuning process incorporated prompts containing scoring rubrics, examples, and the target essays. Given that our private dataset includes scores in three sub-dimensions and an overall score, we explored two fine-tuning strategies. The first approach directly generates all four scores simultaneously. Alternatively, we experimented with training three specialized models, each focusing on a distinct scoring dimension. We then combined the scores from these expert models to obtain the overall score.

BERT baseline

Similar to the model used in Yang et al. ( 2020 ); Han et al. ( 2023 ) ’s work, we implemented a simple yet effective baseline model for score prediction based on BERT. This model integrated a softmax prediction layer following the BERT output, and the BERT parameters remained unfrozen during training. Both the BERT model and the softmax layer were jointly trained on the training essay set. For a detailed description of the model and its training configurations, please refer to Appendix B .

4.2 Feedback Generation

One of the primary advantages of LLMs over traditional methods lies in their capacity to generate clear, explainable feedback in natural language. This capability is especially valuable to humans seeking to understand and improve their performance. In our AES system (Figure 2 ), we have seamlessly integrated a feedback generation module to leverage this unique feature. Utilizing GPT-4, this module is equipped with essential materials and the scores predicted by the scoring model. It is then tasked with clarifying these scores in alignment with the specified rubrics and providing detailed suggestions for improvement. Through subsequent human evaluations, we have discovered that such personalized and explainable feedback is vital in facilitating the human grading process.

Refer to caption

5 Experimental Results

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Set 8
GPT-4, zero-shot, w/o rubrics 0.0423 0.4017 0.2805 0.5571 0.3659 0.5021 0.0809 0.4188
GPT-4, zero-shot, with rubrics 0.0715 0.3003 0.3661 0.6266 0.5227 0.3448 0.1101 0.4072
GPT-4, few-shot, with rubrics 0.2801 0.3376 0.3308 0.7839 0.6226 0.7284 0.2570 0.4541
Fine-tuned GPT-3.5 0.7406 0.6183 0.7041 0.8593 0.7959 0.8480 0.7271 0.6135
BERT baseline 0.6486 0.6284 0.7327 0.7669 0.7432 0.6810 0.7165 0.4624
Overall Content Language Structure
GPT-4, zero-shot, w/o rubrics 0.4688 0.4412 0.3081 0.5757
GPT-4, zero-shot, with rubrics 0.5344 0.5391 0.4660 0.4256
GPT-4, few-shot, with rubrics 0.6729 0.6484 0.6278 0.4661
Fine-tuned GPT-3.5, multi-task 0.7620 0.7224 0.7494 0.6422
Fine-tuned GPT-3.5, ensemble 0.7806 0.7141 0.7605 0.6811
BERT baseline 0.7250 0.6911 0.6980 0.6358

5.1 LLM Performance on both the Public and Private Datasets

We conducted experiments using the LLM-based methods and the baseline approach across all 8 subsets of the ASAP dataset. We employed Cohen’s Quadratic Weighted Kappa (QWK) as our primary evaluation metric, which is the most widely recognized automatic metric in AES tasks (Ramesh and Sanampudi, 2022 ) . QWK reflects the degree of agreement between the prediction and the ground truth. A QWK score of 1 denotes perfect agreement among raters. Conversely, a score of 0 suggests that the agreement is merely by chance, while a negative score indicates worse than a random agreement.

For methods that require a training dataset, such as the BERT baseline and the fine-tuned method, we performed data division for each subset using an 80:20 split between training and testing data. Supervised training was carried out on the training set, and the subsequent predictions were evaluated on the testing set. The retrieval-based approaches utilized the same training and testing datasets. The training sets served as the retrieval database and the performances were assessed on the testing set.

Our extensive experiments (as shown in Table 1 ) revealed that models trained via supervised methods exhibited the best performances. Despite its straightforward structure and training approach, our BERT classifier achieved relatively high performance. Fine-tuned with the essays and their corresponding scores, GPT-3.5 reached even higher QWK values. As shown in Table 1 , the fine-tuned GPT-3.5 outperformed the BERT baseline in six of the eight subsets within the ASAP dataset. This underscores the potential of domain-specific LLMs in achieving superior performance.

In contrast, the zero-shot and few-shot capabilities of LLMs in the ASAP context did not show as significant results as their performance on the CEFR scale, as noted in the study by Yancey et al. ( 2023 ) . In zero-shot scenarios, GPT-4 often displayed low scoring performance, with some subsets performing almost as poorly as random guessing (e.g., Set 1 with a QWK of 0.0423 and Set 7 with a QWK of 0.0809). This underperformance might be attributed to the broader scoring ranges and more intricate rubrics in ASAP compared to CEFR. The addition of rubrics information did little to enhance performance, suggesting that even GPT-4, the most advanced LLM to date, may struggle with fully comprehending and following complicated human instructions. In the few-shot setting, although there was an improvement in scoring performance over the zero-shot scenarios, particularly for Sets 4-6, GPT-4’s performance still lagged significantly behind the BERT baseline.

Furthermore, we conducted experiments using our private essay set. The results (refer to Table 2 ) align with the observations from the ASAP dataset. When provided with detailed information such as rubrics and examples, GPT-4 shows improved performance, yet it still does not exceed the BERT baseline. The fine-tuned GPT-3.5 models exhibited the highest levels of performance. We also explored the two fine-tuning approaches mentioned previously, and discovered that the ensemble of grading expert models yielded the best QWK values of the Overall score, Content score, and Structure score.

5.2 Further Analyses

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Set 8
Trained on Set 1 BERT baseline - 0.3299 0.1680 0.1380 0.3045 0.1234 0.3002 0.1541
Fine-tuned GPT-3.5 - 0.5216 0.5405 0.4891 0.5076 0.6344 0.6306 0.3126
Trained on Set 2 BERT baseline 0.2776 - 0.1975 0.2392 0.1750 0.1453 0.2474 0.3783
Fine-tuned GPT-3.5 0.4270 - 0.4131 0.4619 0.5958 0.5579 0.5438 0.6684
Trained on Set 3 BERT baseline 0.3468 0.4444 - 0.6230 0.6319 0.5299 0.4368 0.2427
Fine-tuned GPT-3.5 0.3991 0.2488 - 0.7674 0.7714 0.7150 0.4964 0.1134
Trained on Set 4 BERT baseline 0.3257 0.5332 0.6267 - 0.5483 0.4959 0.4659 0.3204
Fine-tuned GPT-3.5 0.0631 0.3493 0.4908 - 0.6515 0.7420 0.0865 0.3419
Trained on Set 5 BERT baseline 0.4051 0.3341 0.4264 0.4202 - 0.5243 0.3255 0.2035
Fine-tuned GPT-3.5 0.4354 0.4301 0.5765 0.6877 - 0.7368 0.1061 0.3118
Trained on Set 6 BERT baseline 0.3164 0.3462 0.4000 0.3067 0.4882 - 0.2303 0.3047
Fine-tuned GPT-3.5 0.1342 0.3607 0.4579 0.3157 0.3734 - 0.0061 0.0859
Trained on Set 7 BERT baseline 0.0975 0.0086 0.1854 0.0328 0.0554 0.1244 - 0.2917
Fine-tuned GPT-3.5 0.5862 0.3993 0.4865 0.4425 0.4494 0.4417 - 0.2157
Trained on Set 8 BERT baseline 0.0560 0.1102 0.0110 0.0164 0.0371 0.0454 0.1777 -
Fine-tuned GPT-3.5 0.2714 0.4822 0.4768 0.6009 0.4199 0.3231 0.5460 -

Consistency of LLM-graded scores

To assess the consistency of the predicted scores obtained from LLM-based methods, we replicated the same experiment thrice, setting the temperature parameter of both GPT-4 and GPT-3.5 to 0 0 . It is observed that more than 80 % percent 80 80\% 80 % of the ratings remain unchanged. The small variation across these trials indicates a high level of consistency in the scoring by LLM-based systems. Subsequently, we computed the average of these three values to determine the final score.

Generalizability of the LLM-based AES system

The eight subsets of the ASAP dataset, with their diverse scoring criteria and ranges, provide an excellent basis for evaluating a model’s generalization capabilities. The GPT-4, zero-shot, with rubrics approach, which doesn’t require training data, can adapt to different scoring rubrics and ranges according to the instructions (as seen in Table 1 ). For methods like fine-tuning and the BERT baseline that require training data, we first trained the models on one subset and then evaluated their performance across the remaining datasets. For instance, we would train on Set 1 and test on Sets 2-8, keeping the model weights fixed.

The results, detailed in Table 3 , indicate that our fine-tuned GPT-3.5 generally outperforms the BERT baseline, although there are instances of underperformance (notably when trained on Set 4 and tested on Sets 1 and 7). The BERT model especially exhibits weak generalization when trained on Sets 7 and 8, with performances close to random guessing. This limitation may stem from BERT’s inability to adapt to score ranges outside its training scope. For instance, a BERT model trained on Set 3, with a score range of 0-3, struggles to predict scores in the 4-6 range for Set 2, which has a score range of 1-6. In contrast, after fine-tuning, LLMs are capable of following new prompts and generating scores beyond their initial training range. In some cases, the generalization performance of fine-tuned GPT-3.5 even exceeded the zero-shot and few-shot capabilities of GPT-4. This superior understanding and adherence to new prompts, scoring rubrics, and ranges, along with better alignment to human instructions, are key factors in the enhanced generalization performance of our AES system.

6 LLM-Assisted Human Evaluation Experiment

When assessing the efficacy of generative AI, it’s crucial to go beyond automated metrics and also evaluate the experiences and preferences of human users. In the previous section, our findings indicated that fine-tuned LLM achieved high accuracy, consistency, and generalizability. Moreover, LLM-based systems could provide comprehensive explanations for the scores, an aspect that cannot be measured by automated metrics like QWK.

Consequently, in this section, we carried out human evaluation experiments to investigate the perceptions and views of real-world graders toward LLM’s automated scoring. Through the carefully designed experiments, we revealed significant insights relevant to AI-assisted human grading and broader human-AI collaboration contexts.

6.1 Experiment Design

For our experiments, we randomly selected 50 essays from the test set of our private dataset, all on the same topic. We recruited 10 college students from a Normal University in Beijing, prospective high school teachers but with no grading experience currently, as novice evaluators, and 5 experienced high school English teachers as expert evaluators. Initially, evaluators graded the essays using standard rubrics. Subsequently, they were provided with the corresponding LLM rating scores and explanations, generated by our best-performing Fine-tuned GPT-3.5 ensemble model. Evaluators then had the option to revise their initial scores based on this information. Finally, we distributed questionnaires to gather evaluators’ feedback. They were asked to rate each question on a 5-point Likert scale, with higher scores indicating better performance.

Refer to caption

On the other hand, for an intriguing comparative analysis, we provided the GPT-4, few-shot, with rubrics model with scores and explanations produced by the Fine-tuned GPT-3.5 ensemble , which has a superior grading performance. Our objective is to explore whether GPT-4 can critically evaluate this supplementary reference information and enhance its performance. This investigation aligns with the concepts of self-reflection (Ji et al., 2023 ; Wang et al., 2023 ) , LLM-as-a-Judge (Zheng et al., 2023 ; Goes et al., 2022 ) , and the mixture-of-experts (Shen et al., 2023 ) .

We mainly focus on three research questions:

Can LLM-generated AES feedback enhance the grading performance of novice evaluators?

Can LLM-generated AES feedback enhance the grading performance of expert evaluators?

Can superior LLM-generated AES feedback enhance the grading performance of the general LLM evaluator?

6.2 Results

Feedback generated by llm elevates novice evaluators to expert level..

As illustrated in Figure 4 and Table 4 , we found that novice graders, with the assistance of LLM-generated feedback (including both scores and explanations), achieved an average QWK of 0.6609. This performance is significantly higher than their initial average QWK of 0.5256, with a p-value of less than 0.01. Furthermore, comparing the LLM-assisted novices (mean QWK 0.6609) with the expert graders (mean QWK 0.6965), we found no statistical difference between the two groups (p-value = 0.43). This suggests that the LLM-assisted novice evaluators attained a level of grading proficiency comparable to that of experts. We focus on the evaluation of overall rating scores in this section. Similar trends were observed in the content, language, and structure scores, with detailed results available in Table 5 in the Appendices.

Diff. t statistic p-value
Expert vs Novice 0.1709** 2.966 0.0109
Novice+LLM vs Novice 0.1353*** 2.8882 0.0098
Expert+LLM vs Expert 0.0395 0.9501 0.3699
Novice+LLM vs Expert -0.0356 -0.8175 0.4284

Refer to caption

Feedback generated by LLM boosts expert efficiency and consistency.

The introduction of LLM-generated feedback to experts resulted in their average QWK increasing from 0.6965 to 0.7360. However, this improvement is not statistically significant (p-value = 0.37). The value of LLM augmentation for the experts manifests in other ways: experts needed less time to complete grading tasks when assisted by LLM, as indicated in their self-report questionnaires (see Table 5 ). Additionally, we noted a decrease in the standard deviation of expert ratings, especially in the Language dimension (Table 5 in the Appendices), indicating greater consensus among them. This suggests that LLM-generated feedback helps eliminate some divergent opinions, leading to more consistent evaluations of student essays. Experienced domain experts also commended the accuracy and helpfulness of the LLM-generated feedback, highlighting its potential in real-world educational settings for enhancing grading efficiency.

Score

Perceived accuracy of LLM overall score

4.3/5

Perceived accuracy of LLM content score

4.0/5

Perceived accuracy of LLM language score

3.9/5

Perceived accuracy of LLM structure score

3.8/5

Helpfulness of LLM score

4.7/5

Helpfulness of LLM explanations

4.8/5

Efficiency of LLM assistance

4.4/5

Willingness to utilize the LLM assistant

4.3/5

Human evaluators outperform LLM evaluators when it comes to effectively leveraging high-quality feedback.

The GPT-4, few-shot, with rubrics configuration served as the baseline general LLM evaluator, achieving a QWK score of 0.6729, which is comparable to expert-level performance. However, despite a carefully crafted prompt for the GPT-4 evaluator to utilize additional feedback, there was no substantial improvement in its performance (see Figure 4 ). In fact, in the Content and Language dimensions, the QWK scores showed a slight decline. On the contrary, as previously discussed, both novice and expert human evaluators effectively utilized superior LLM-generated feedback to enhance their grading in terms of accuracy, consistency, and efficiency. This indicates that while GPT-4 can match expert-level grading in general performance, its capacity to critically improve with additional information is limited compared to human evaluators’ reflective and adaptive abilities. This observation highlights the potential need for more advanced ensemble approaches that enable LLMs to better integrate additional expert feedback for improved performance.

Challenges in fully utilizing additional information by human and LLM evaluators.

In our analysis, we established the upper bound for the utilization of additional information as the scenario where the most advantageous outcomes are consistently chosen in the combined ratings. For example, an expert evaluator using LLM-generated feedback reaches her performance upper bound (a QWK of 0.8020) by adopting superior LLM advice where applicable, and relying on her own judgments when they are closer to the ground truth. Our findings indicate that none of the three experiment groups (human novices with fine-tuned feedback, human experts with fine-tuned feedback, and GPT-4 with fine-tuned feedback) managed to surpass the performance of the fine-tuned GPT-3.5 , nor approach their respective upper bounds. Notably, novices altered their initial scores in 69.5% of cases, yet only 60% of these modifications resulted in improved performance, with the rest being counterproductive. Conversely, the experts, displaying greater confidence in their judgments, adhered to LLM-generated feedback in just 26% of instances, but 84.6% of these adjustments were beneficial, bringing them closer to the ground truth scores. This raises compelling questions about why both humans and LLMs fall short of outperforming the superior reference scores and how more nuanced design strategies might elevate their performance, presenting a fascinating direction for future research.

7 Conclusion

In this study, we investigated the potential of LLMs, with a particular focus on OpenAI’s GPT-4 and fine-tuned version of GPT-3.5, in the context of Automated Essay Scoring (AES) systems. Equipped with comprehensive contexts, clear rubrics, and high-quality examples, GPT-4 demonstrated satisfactory performance and an ability to generalize effectively. Furthermore, our proposed ensemble of fine-tuned GPT-3.5 models exhibited superior accuracy, consistency, and generalizability, surpassing the traditional BERT baseline. Notably, leveraging LLMs allowed us to provide explanations and suggestions alongside the rating scores.

We extended our investigation by conducting experiments involving human evaluators, augmented by LLM assistance, encompassing individuals ranging from novices to experts. A significant revelation emerged: LLMs not only automated the grading process but also enhanced the proficiency of human graders. Novice graders, when aided by LLM feedback, achieved levels of accuracy akin to those of experts, while expert graders improved in efficiency and consistency. However, it was observed that neither human nor LLM evaluators could surpass the constraints imposed by the reference information, preventing them from reaching the upper bounds of performance. These insights underscore the transformative potential of AI-generated feedback, particularly in elevating individuals with limited domain knowledge to expert-level proficiency. Our research sheds light on future avenues in the realm of human-AI collaboration and the evolving role of AI in the field of education.

Limitations

This study is not without limitations. Firstly, our private dataset was exclusively derived from a single-semester final exam administered to high school English learners in China. While we have endeavored to assess the generalizability of the Fine-tuned GPT-3.5 model using the ASAP dataset, concerns may still arise regarding the robustness of our proposed AES system when applied to a broader range of topics and student demographics. Moreover, it is worth noting that our study involved a relatively limited number of essay samples and human evaluators. In future research endeavors, we intend to incorporate a more extensive sample size to augment the validity of our findings. This expanded dataset will enable us to delve deeper into the mechanisms that underlie the observed improvements in evaluators’ performance facilitated by the feedback from LLMs.

  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 .
  • Dong and Zhang (2016) Fei Dong and Yue Zhang. 2016. Automatic features for essay scoring–an empirical study. In Proceedings of the 2016 conference on empirical methods in natural language processing , pages 1072–1077.
  • Goes et al. (2022) Fabricio Goes, Zisen Zhou, Piotr Sawicki, Marek Grzes, and Daniel G Brown. 2022. Crowd score: A method for the evaluation of jokes using large language model ai voters as judges. arXiv preprint arXiv:2212.11214 .
  • Han et al. (2023) Jieun Han, Haneul Yoo, Junho Myung, Minsun Kim, Hyunseung Lim, Yoonsu Kim, Tak Yeon Lee, Hwajung Hong, Juho Kim, So-Yeon Ahn, et al. 2023. Fabric: Automated scoring and feedback generation for essays. arXiv preprint arXiv:2310.05191 .
  • Ji et al. (2023) Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. 2023. Towards mitigating LLM hallucination via self reflection . In Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 1827–1843, Singapore. Association for Computational Linguistics.
  • Khandelwal et al. (2020) Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2020. Generalization through memorization: Nearest neighbor language models . In International Conference on Learning Representations .
  • Lampinen et al. (2022) Andrew Lampinen, Ishita Dasgupta, Stephanie Chan, Kory Mathewson, Mh Tessler, Antonia Creswell, James McClelland, Jane Wang, and Felix Hill. 2022. Can language models learn from explanations in context? In Findings of the Association for Computational Linguistics: EMNLP 2022 , pages 537–563, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Li et al. (2023) Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. 2023. Making language models better reasoners with step-aware verifier . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 5315–5333, Toronto, Canada. Association for Computational Linguistics.
  • Lun et al. (2020) Jiaqi Lun, Jia Zhu, Yong Tang, and Min Yang. 2020. Multiple data augmentation strategies for improving performance on automatic short answer scoring. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 34, pages 13389–13396.
  • Mathias and Bhattacharyya (2018a) Sandeep Mathias and Pushpak Bhattacharyya. 2018a. Asap++: Enriching the asap automated essay grading dataset with essay attribute scores. In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018) .
  • Mathias and Bhattacharyya (2018b) Sandeep Mathias and Pushpak Bhattacharyya. 2018b. Thank “goodness”! a way to measure style in student essays. In Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications , pages 35–41.
  • Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 .
  • Miltsakaki and Kukich (2004) Eleni Miltsakaki and Karen Kukich. 2004. Evaluation of text coherence for electronic essay scoring systems. Natural Language Engineering , 10(1):25–55.
  • Mizumoto and Eguchi (2023) Atsushi Mizumoto and Masaki Eguchi. 2023. Exploring the potential of using an ai language model for automated essay scoring. Research Methods in Applied Linguistics , 2(2):100050.
  • Naismith et al. (2023) Ben Naismith, Phoebe Mulcaire, and Jill Burstein. 2023. Automated evaluation of written discourse coherence using GPT-4 . In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) , pages 394–403, Toronto, Canada. Association for Computational Linguistics.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730–27744.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation . In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
  • Ram et al. (2023) Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083 .
  • Ramesh and Sanampudi (2022) Dadi Ramesh and Suresh Kumar Sanampudi. 2022. An automated essay scoring systems: a systematic literature review. Artificial Intelligence Review , 55(3):2495–2527.
  • Ridley et al. (2020) Robert Ridley, Liang He, Xinyu Dai, Shujian Huang, and Jiajun Chen. 2020. Prompt agnostic essay scorer: a domain generalization approach to cross-prompt automated essay scoring. arXiv preprint arXiv:2008.01441 .
  • Riordan et al. (2017) Brian Riordan, Andrea Horbach, Aoife Cahill, Torsten Zesch, and Chungmin Lee. 2017. Investigating neural architectures for short answer scoring. In Proceedings of the 12th workshop on innovative use of NLP for building educational applications , pages 159–168.
  • Rodriguez et al. (2019) Pedro Uria Rodriguez, Amir Jafari, and Christopher M Ormerod. 2019. Language models and automated essay scoring. arXiv preprint arXiv:1909.09482 .
  • Salim et al. (2019) Yafet Salim, Valdi Stevanus, Edwardo Barlian, Azani Cempaka Sari, and Derwin Suhartono. 2019. Automated english digital essay grader using machine learning. In 2019 IEEE International Conference on Engineering, Technology and Education (TALE) , pages 1–6. IEEE.
  • Shen et al. (2023) Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen, et al. 2023. Mixture-of-experts meets instruction tuning: A winning combination for large language models. arXiv preprint arXiv:2305.14705 .
  • Shi et al. (2023) Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2023. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652 .
  • Sultan et al. (2016) Md Arafat Sultan, Cristobal Salazar, and Tamara Sumner. 2016. Fast and easy short answer grading with high accuracy. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 1070–1075.
  • Taghipour and Ng (2016) Kaveh Taghipour and Hwee Tou Ng. 2016. A neural approach to automated essay scoring. In Proceedings of the 2016 conference on empirical methods in natural language processing , pages 1882–1891.
  • Wang et al. (2022) Yongjie Wang, Chuang Wang, Ruobing Li, and Hui Lin. 2022. On the use of bert for automated essay scoring: Joint learning of multi-scale essay representation . In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 3416–3425, Seattle, United States. Association for Computational Linguistics.
  • Wang et al. (2023) Ziqi Wang, Le Hou, Tianjian Lu, Yuexin Wu, Yunxuan Li, Hongkun Yu, and Heng Ji. 2023. Enable language models to implicitly learn self-improvement from data. arXiv preprint arXiv:2310.00898 .
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems , 35:24824–24837.
  • Yancey et al. (2023) Kevin P. Yancey, Geoffrey Laflair, Anthony Verardi, and Jill Burstein. 2023. Rating short L2 essays on the CEFR scale with GPT-4 . In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) , pages 576–584, Toronto, Canada. Association for Computational Linguistics.
  • Yang et al. (2020) Ruosong Yang, Jiannong Cao, Zhiyuan Wen, Youzheng Wu, and Xiaodong He. 2020. Enhancing automated essay scoring performance via fine-tuning pre-trained language models with combination of regression and ranking. In Findings of the Association for Computational Linguistics: EMNLP 2020 , pages 1560–1569.
  • Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685 .
  • Zhou et al. (2023) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H. Chi. 2023. Least-to-most prompting enables complex reasoning in large language models . In The Eleventh International Conference on Learning Representations .

Appendix A Datasets

The details of the ASAP dataset are presented in Table 7 . As previously mentioned, this dataset is composed of 8 subsets, each with unique prompts and scoring rubrics. Our private Chinese Student English Essay dataset consists of 6,559 essays, along with their corresponding scores carefully rated by experienced English teachers based on the scoring standards in the Chinese National College Entrance Examination (Table 8 ). The basic statistics of this dataset are outlined in Table 6 .

Chinese Student English Essay Dataset
# of schools 29
# of essays 6,559
avg. essay length 120.92
avg. Overall score 10.18
avg. Content score 3.92
avg. Language score 3.80
avg. Structure score 2.46
Essay Set Essay Type Grade Level # of Essays Avg. Length Score Range
1 Persuasive/Narrative/Expository 8 1783 350 [2, 12]
2 Persuasive/Narrative/Expository 10 1800 350 [1, 6]
3 Source Dependent Responses 10 1726 150 [0, 3]
4 Source Dependent Responses 10 1772 150 [0, 3]
5 Source Dependent Responses 8 1805 150 [0, 4]
6 Source Dependent Responses 10 1800 150 [0, 4]
7 Persuasive/Narrative/Expository 7 1569 300 [0, 12]
8 Persuasive/Narrative/Expository 10 723 650 [0, 36]
Rubrics
(20 points) = Content Score (8 points) + Language Score (8 points) + Structure Score (4 points)

Content Dimension (8 points in total)

Language Dimension (8 points in total)

Structure Dimension (4 points in total)

Appendix B BERT Baseline

We employed the bert-base-uncased BERT model from the huggingface transformers library 4 4 4 https://huggingface.co/docs/transformers/ using PyTorch. A simple softmax layer was added to perform the classification task. The datasets were divided into training and testing sets at an 8:2 ratio. To ensure better reproducibility, we set all random seeds, including those for dataset splitting and model training, to the value 42. During training, we used cross-entropy loss as our loss function. We allowed BERT parameters to be fine-tuned, without freezing them, in line with the objective function. AdamW was chosen as the optimizer, with a learning rate set to 10 − 5 superscript 10 5 10^{-5} 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and epsilon at 10 − 6 superscript 10 6 10^{-6} 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT . With a batch size of 16 and a maximum of 10 training epochs, we also integrated an early stopping strategy to mitigate potential overfitting.

Appendix C LLM Prompts

The prompts used for LLMs in our study fall into three distinct categories: firstly, the zero-shot and few-shot configurations of GPT-4 ; secondly, the instructions for training and inference of the Fine-tuned GPT-3.5 ; and thirdly, the prompts for the GPT-4 evaluator which critically utilized the LLM-generated feedback in our human evaluation experiments. Notably, the prompts for the few-shot scenario incorporate those used in the zero-shot setting and overlap with the fine-tuning prompts. Therefore, for clarity and conciseness, we present examples of the GPT-4, few-shot, with rubrics and the GPT-4 evaluator prompts in Table 9 .

Example Prompt Example Output

Here are the specific guidelines for each score:

{Rubrics}

Sample Essay Prompt:

{Essay Prompt}

The graded example essays:

{Examples}

Student’s Essay to Evaluate:

{Target Student Essay}

Task Breakdown:

1. Carefully read the provided essay prompt, scoring guidelines, and the student’s essay.

2. In the Explanations part, identify specific elements in the essay referring to the rubrics. In the language dimension, list all the spelling and grammar errors, and count the number of them to determine the Language Score. The Explanations for each dimension should be as detailed as possible.

3. Determine the appropriate scores according to the analysis above.

Please present your evaluation in the following manner:

Explanations: …, Content Score: …

Explanations: …, Language Score: …

Explanations: …, Structure Score: …

Explanations: …, Total Score: …

Your final evaluation:

[Total Score: …, Content Score: …, Language Score: …, Structure Score: …]

Content Score: 8

Explanations: There are no language errors; the grammar and spelling are all accurate. The student used formal, respectful language suitable for this kind of communication and also included diversity in sentence structures.

Language Score: 8

Explanations: The essay is well-structured and organized. It follows the accepted structure of a reply email, with an opening greeting, a body containing the main content, and a closing remark. Transitions are also smooth and coherent, logically linking one thought to another.

Structure Score: 4

Total Score: This is the sum of the content, language, and structure scores. Total Score: 20

Your final evaluation:

[Total Score: 20, Content Score: 8, Language Score: 8, Structure Score: 4]

Objective: Evaluate student essays using an expert rating model’s scores. Critically analyze these essays against a specific rubric to determine the appropriateness of the model’s scores.

Scoring Criteria:

{Rubrics}

Expert Rating Model Context:

- Accuracy: High accuracy, trained on extensive data.

- Limitations: Struggles with extreme cases.

- Usage: Generally, model scores are reliable. Direct use is often appropriate. Adjust scores only when you think it is necessary.

Instances of Inaccurate and Accurate Model Scoring:

{Examples}

Materials for Evaluation:

1. Essay Prompt: {Essay Prompt}

2. Student Essay: {Target student Essay}

3. Evaluation by the Expert Model: {Feedback of Fine-tuned GPT-3.5}

Evaluation Format:

[Total Score: …, Content Score: …, Language Score: …, Structure Score: …, Explanations: …]

Explanations: The student essay introduces the design ideas and explains the reasons behind the design, which aligns with the essay prompt. The content is mostly complete, but could benefit from more detailed explanations, particularly regarding the reasons behind the design choices. This warrants a Content Score of 4, placing it in Level 2. The language used in the essay is not quite accurate, with some errors such as ’Wondering you have experiences about it’ instead of ’Wondering if you have experience with it’. The sentence structures show some variation, so a Language Score of 4 is appropriate, also placing it in Level 2. The essay is mostly clearly and logically structured, with a discernible introduction, body, and conclusion. However, the transitions between ideas could be smoother, and the organization of the content could be more coherent. Therefore, a Structure Score of 3 is fitting, which is at the high end of Level 2. The expert model’s scores seem to slightly underestimate the student’s performance, particularly in the content and structure dimensions.]

Appendix D Experiment Details

In our LLM-assisted human evaluation experiments, the 10 participating college students were all from a Normal University in Beijing, and had a male-to-female ratio of 4:6, with ages ranging from 19 to 23 years (from freshmen to seniors). Their English capabilities were certified by China’s College English Test (CET). None of the novices have the experience of grading high school student essays currently. The group of 5 expert evaluators comprised experienced English teachers from Beijing high schools, with teaching tenures ranging from 8 to 20 years. Before undertaking the evaluation tasks, all participants received training on the standard scoring rubrics. They were also incentivized with appropriate remuneration for their participation.

The LLM-assisted human evaluation experiment results of Overall, Content, Language, and Structure scores are presented in Figure 5 . We observed that the Content and Language scores exhibit a similar trend as the Overall score discussed in the Results section. However, an anomaly was noted in the experts’ performance in the Structure score, which was unexpectedly low. This deviation was attributed to an outlier among the expert evaluators who performed poorly. Additionally, the expert evaluators also noted that the Structure dimension is the most ambiguous and difficult part of the grading task. Upon exclusion of this particular sample, the experts’ performance in the Structure dimension aligned back to expected levels.

Refer to caption

Automated Essay Scoring and Revising Based on Open-Source Large Language Models

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, recommendations, automatic essay scoring: design and implementation of automatic amharic essay scoring system using latent semantic analysis, a ranked-based learning approach to automated essay scoring.

Automated essay scoring is the computer techniques and algorithms that evaluate and score essays automatically. Compared with human rater, automated essay scoring has the advantage of fairness, less human resource cost and timely feedback. In previous ...

Automated Essay Scoring via Example-Based Learning

Automated essay scoring (AES) is the task of assigning grades to essays. It can be applied for quality assessment as well as pricing on User Generated Content. Previous works mainly consider using the prompt information for scoring. However, some ...

Information

Published in.

IEEE Computer Society Press

Washington, DC, United States

Publication History

  • Research-article

Contributors

Other metrics, bibliometrics, article metrics.

  • 0 Total Citations
  • 0 Total Downloads
  • Downloads (Last 12 months) 0
  • Downloads (Last 6 weeks) 0

View options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

  • Corpus ID: 18122360

Automated Essay Grading

  • Alex Adamson , Andrew Lamb , Ralph Ma
  • Published 2014
  • Computer Science, Education

Figures from this paper

figure 1

6 Citations

Factograde: automated essay scoring system, evaluation of essay using machine learning techniques, coherence‐based automatic short answer scoring using sentence embedding, an automated essay scoring systems: a systematic literature review.

  • Highly Influenced
  • 11 Excerpts

Proactive and reactive engagement of artificial intelligence methods for education: a review

The state-of-the-art in twitter sentiment analysis, 4 references, evaluating multiple aspects of coherence in student essays, book review: natural language processing with python by steven bird, ewan klein, and edward loper, random indexing of text samples for latent semantic analysis, related papers.

Showing 1 through 3 of 0 Related Papers

Automatic Essay Grading System Using Deep Neural Network

  • Conference paper
  • First Online: 02 October 2023
  • Cite this conference paper

automated essay grading

  • Vikkurty Sireesha 6 ,
  • Nagaratna P. Hegde 6 ,
  • Sriperambuduri Vinay Kumar 6 ,
  • Alekhya Naravajhula 7 &
  • Dulugunti Sai Haritha 8  

Part of the book series: Cognitive Science and Technology ((CSAT))

Included in the following conference series:

  • International Conference on Information and Management Engineering

256 Accesses

Essays are important for testing students’ academic scores, creativity, and being able to remember what they studied, but grading them manually is really expensive and time-consuming for a large number of essays. This project aims to implement and train neural networks to assess and grade essays automatically. The human grades given to the essays should be matched with grades generated from our automatic essay grading system consistently with minimum error. Automated essay grading can be used for evaluating essays written according to specific prompts or specific topics. It is the process of automating scoring system of essays without any human intervention and using computer programs. This system is most beneficial for educators since it helps reducing manual work and saves a lot of time. It not only saves a lot of time but also speeds up the process of learning feedback. We used deep neural networks in our system instead of traditional machine learning models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

automated essay grading

Smart Grading System Using Bi LSTM with Attention Mechanism

automated essay grading

Automatically Grading Brazilian Student Essays

automated essay grading

A review of deep-neural automated essay scoring models

Dong F, Zhang Y, Yang J (2017) Attention-based recurrent convolutional neural network for automatic essay scoring. In: Proceedings of the 21st conference on computational natural language learning (CoNLL 2017), pp153–162.

Google Scholar  

dos Santos CN, Gatti M. Deep convolutional neural networks for sentiment analysis of short texts. In: Proceedings of the 25th international conference on computational linguistics (COLING), Dublin, Ireland

Alikaniotis D, Yannakoudakis H, Rei M (2016) Automatic text scoring using neural networks. ArXiv:1606.04289

Boulanger D, Kumar V (2019) Shedding light on the automated essay scoring process. In Proceedings of the 12th international conference on educational data mining (EDM)

Liang G, On B-W, Jeong D, Kim H-C, Choi G (2018) Automated essay scoring: a Siamese bidirectional LSTM neural network architecture. Symmetry 10(12):682

Article   Google Scholar  

Cozma M, Butnaru AM, Ionescu RT (2018) Automated essay scoring with string kernels and word embeddings. ArXiv:1804.07954

Download references

Acknowledgements

We thank Vasavi College of Engineering (Autonomous), Hyderabad for the support extended toward this work.

Author information

Authors and affiliations.

Vasavi College of Engineering, Hyderabad, India

Vikkurty Sireesha, Nagaratna P. Hegde & Sriperambuduri Vinay Kumar

Accolite Digital, Hyderabad, India

Alekhya Naravajhula

Providence, Hyderabad, India

Dulugunti Sai Haritha

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Vikkurty Sireesha .

Editor information

Editors and affiliations.

BioAxis DNA Research Centre Private Limited, Hyderabad, Andhra Pradesh, India

Department of Computer Science, Brunel University, Uxbridge, UK

Gheorghita Ghinea

CMR College of Engineering and Technology, Hyderabad, India

Suresh Merugu

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper.

Sireesha, V., Hegde, N.P., Kumar, S.V., Naravajhula, A., Haritha, D.S. (2023). Automatic Essay Grading System Using Deep Neural Network. In: Kumar, A., Ghinea, G., Merugu, S. (eds) Proceedings of the 2nd International Conference on Cognitive and Intelligent Computing. ICCIC 2022. Cognitive Science and Technology. Springer, Singapore. https://doi.org/10.1007/978-981-99-2746-3_53

Download citation

DOI : https://doi.org/10.1007/978-981-99-2746-3_53

Published : 02 October 2023

Publisher Name : Springer, Singapore

Print ISBN : 978-981-99-2745-6

Online ISBN : 978-981-99-2746-3

eBook Packages : Intelligent Technologies and Robotics Intelligent Technologies and Robotics (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Automated Essay Scoring in the Presence of Biased Ratings

Evelin Amorim , Marcia Cançado , Adriano Veloso

Export citation

  • Preformatted

Markdown (Informal)

[Automated Essay Scoring in the Presence of Biased Ratings](https://aclanthology.org/N18-1021) (Amorim et al., NAACL 2018)

  • Automated Essay Scoring in the Presence of Biased Ratings (Amorim et al., NAACL 2018)
  • Evelin Amorim, Marcia Cançado, and Adriano Veloso. 2018. Automated Essay Scoring in the Presence of Biased Ratings . In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages 229–237, New Orleans, Louisiana. Association for Computational Linguistics.
  • Browse by Year
  • Browse by Subject
  • Browse by Division
  • Browse by Author

IMPLEMENTASI METODE NATURAL LANGUAGE PROCESSING (NLP) PADA AUTOMATED ESSAY SCORING (AES)

Akbar, Ghauzar Andhika (2024) IMPLEMENTASI METODE NATURAL LANGUAGE PROCESSING (NLP) PADA AUTOMATED ESSAY SCORING (AES). Skripsi (S1) thesis, Universitas Muhammadiyah Ponorogo.


SURAT PERSETUJUAN UNGGAH KARYA ILMIAH.pdf

HALAMAN DEPAN.pdf

BAB I.pdf

BAB II.pdf
Restricted to Repository staff only
|

BAB III.pdf
Restricted to Repository staff only
|

BAB IV.pdf
Restricted to Repository staff only
|

BAB V.pdf
Restricted to Repository staff only
|

DAFTAR PUSTAKA.pdf

SKRIPSI FULL TEXT.pdf
Restricted to Repository staff only
|

Evaluation in the educational process is necessary to determine the success rate of students' learning. One of the methods used is through essay exams. Challenges arise during the essay exam assessment process as it requires significant time and effort. Additionally, there are issues with inconsistent grading, such as differences in scores despite similar meaning in the answers. Therefore, a practical and efficient method for essay answer assessment is needed. Information technology can be applied in this case using Transformer networks. The method used is to assess essay answers based on the semantic similarity between the key answers and the students' answers through the semantic similarity task. This research aims to implement an NLP model trained in Indonesian, namely IndoBERT, in Automated Essay Scoring. The results of the research, from 10 sets of essay questions, showed a Quadratic Weighted Kappa score ranging from a minimum of 0.17771 to a maximum of 0.80654, and a Root Mean Square Error ranging from a minimum of 1.6329 to a maximum of 5.0197.

Item Type: Thesis (Skripsi (S1))
Uncontrolled Keywords: Automated Essay Scoring, Cosine Similarity, Essay scoring, IndoBERT, Transformer Architecture
Subjects:
Divisions:
Depositing User: ft . userft
Date Deposited: 28 Aug 2024 06:29
Last Modified: 28 Aug 2024 06:29
URI:

Actions (login required)

View Item

IMAGES

  1. Automated Essay Scoring Explained

    automated essay grading

  2. 5 Best Automated AI Essay Grader Software in 2024

    automated essay grading

  3. Automated Grading Systems: How AI is Revolutionizing Exam Evaluation

    automated essay grading

  4. Figure 2 from Design of an Automated Essay Grading (AEG) system in

    automated essay grading

  5. [PDF] Automated Essay Grading Using Machine Learning

    automated essay grading

  6. 70 Best Automated essay grading software AI tools

    automated essay grading

COMMENTS

  1. Automated Essay Scoring

    Essay scoring: **Automated Essay Scoring** is the task of assigning a score to an essay, usually in the context of assessing the language ability of a language learner. The quality of an essay is affected by the following four primary dimensions: topic relevance, organization and coherence, word usage and sentence complexity, and grammar and mechanics.

  2. An automated essay scoring systems: a systematic literature review

    Automated essay scoring (AES) is a computer-based assessment system that automatically scores or grades the student responses by considering appropriate features. The AES research started in 1966 with the Project Essay Grader (PEG) by Ajay et al. . PEG evaluates the writing characteristics such as grammar, diction, construction, etc., to grade ...

  3. Automated essay scoring

    Automated essay scoring (AES) is the use of specialized computer programs to assign grades to essays written in an educational setting. It is a form of educational assessment and an application of natural language processing. Its objective is to classify a large set of textual entities into a small number of discrete categories, corresponding ...

  4. [2401.05655] Unveiling the Tapestry of Automated Essay Scoring: A

    Automatic Essay Scoring (AES) is a well-established educational pursuit that employs machine learning to evaluate student-authored essays. While much effort has been made in this area, current research primarily focuses on either (i) boosting the predictive accuracy of an AES model for a specific prompt (i.e., developing prompt-specific models), which often heavily relies on the use of the ...

  5. Explainable Automated Essay Scoring: Deep Learning Really Has

    Automated essay scoring (AES) is a compelling topic in Learning Analytics for the primary reason that recent advances in AI find it as a good testbed to explore artificial supplementation of human creativity. However, a vast swath of research tackles AES only holistically; few have even developed AES models at the rubric level, the very first ...

  6. A review of deep-neural automated essay scoring models

    Automated essay scoring (AES) is the task of automatically assigning scores to essays as an alternative to grading by humans. Although traditional AES models typically rely on manually designed features, deep neural network (DNN)-based AES models that obviate the need for feature engineering have recently attracted increased attention. Various DNN-AES models with different characteristics have ...

  7. Automated Essay Scoring Systems

    The first widely known automated scoring system, Project Essay Grader (PEG), was conceptualized by Ellis Battan Page in late 1960s (Page, 1966, 1968).PEG relies on proxy measures, such as average word length, essay length, number of certain punctuation marks, and so forth, to determine the quality of an open-ended response item.

  8. About the e-rater Scoring Engine

    The e-rater automated scoring engine uses AI technology and Natural Language Processing (NLP) to evaluate the writing proficiency of student essays by providing automatic scoring and feedback. The engine provides descriptive feedback on the writer's grammar, mechanics, word use and complexity, style, organization and more.

  9. A Comprehensive Review of Automated Essay Scoring (AES) Research and

    Automated Essay Scoring (AES) is a service or software that can predictively grade essay based on a pre-trained computational model. It has gained a lot of research interest in educational ...

  10. Automated Essay Scoring via Pairwise Contrastive Regression

    Automated essay scoring (AES) involves the prediction of a score relating to the writing quality of an essay. Most existing works in AES utilize regression objectives or ranking objectives respectively. However, the two types of methods are highly complementary. To this end, in this paper we take inspiration from contrastive learning and ...

  11. PDF An Overview of Automated Scoring of Essays

    Automated Essay Scoring Systems Project Essay Grader™ (PEG) Project Essay Grader™ (PEG) was developed by Ellis Page in 1966 upon the request of the College Board, which wanted to make the large-scale essay scoring process more practical and effective (Rudner & Gagne, 2001; Page, 2003). PEG™ uses correlation to predict the intrinsic ...

  12. What is Automated Essay Scoring, Marking, Grading?

    Nathan Thompson, PhDApril 25, 2023. Automated essay scoring (AES) is an important application of machine learning and artificial intelligence to the field of psychometrics and assessment. In fact, it's been around far longer than "machine learning" and "artificial intelligence" have been buzzwords in the general public!

  13. The e-rater Scoring Engine

    Our AI technology, such as the e-rater ® scoring engine, informs decisions and creates opportunities for learners around the world. The e-rater engine automatically: assess and nurtures key writing skills. scores essays and provides feedback on writing using a model built on the theory of writing to assess both analytical and independent ...

  14. Automated language essay scoring systems: a literature review

    Automated Essay Scoring (AES) systems usually utilize Natural Language Processing and machine learning techniques to automatically rate essays written for a target prompt (Dikli, 2006). Many AES systems have been developed over the past decades. They focus on automatically analyzing the quality of the composition and assigning a score to the text.

  15. From Automation to Augmentation: Large Language Models Elevating Essay

    Automated Essay Scoring (AES) systems provide valuable assistance to students by offering immediate and consistent feedback on their work, while also simplifying the grading process for educators. However, the effective implementation of AES systems in real-world educational settings presents several challenges. One of the primary challenges is ...

  16. Automated Grading of Essays: A Review

    The automated grading of essay finds the syntactic and semantic features from student answers and reference answers. Then construct a machine learning model that relates these features to the final scores assigned by evaluators. This trained model is used to find score of unseen essays.

  17. Automated Essay Scoring and Revising Based on Open-Source Large

    Automated essay scoring is the computer techniques and algorithms that evaluate and score essays automatically. Compared with human rater, automated essay scoring has the advantage of fairness, less human resource cost and timely feedback.

  18. [PDF] Automated Essay Grading

    Automated Essay Grading. Alex Adamson, Andrew Lamb, Ralph Ma. Published 2014. Computer Science, Education. TLDR. This work trained different models using word features, per-essay statistics, and metrics of similarity and coherence between essays and documents to make predictions that closely match those made by human graders. Expand.

  19. Task-Independent Features for Automated Essay Grading

    zesch-etal-2015-task. Cite (ACL): Torsten Zesch, Michael Wojatzki, and Dirk Scholten-Akoun. 2015. Task-Independent Features for Automated Essay Grading. In Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 224-232, Denver, Colorado. Association for Computational Linguistics.

  20. Large language models and automated essay scoring of English language

    1. Introduction. One of the earliest papers on automated essay scoring (AES) laments the plight of English teachers who labor under the burden of "exorbitant" grading responsibilities and how computerized essay scoring could prove a brilliant solution (Page, 1966). 50 years later objections over teacher workload persist, as do dreams of offloading that work to automation (Godwin-Jones, 2022).

  21. Automatic essay exam scoring system: a systematic literature review

    We synthesize the results to enrich our understanding of the automated essay exam scoring system. The expected result of this research is that it can contribute to further research related to the automated essay exam scoring system, especially in terms of considering methods and dataset forms. © 2022 The Authors. Published by ELSEVIER B.V.

  22. Automatic Essay Grading System Using Deep Neural Network

    The model, on the other hand, can be used as a benchmark for future work in the field of automated essay grading for essays in many domains. By combining content and advanced NLP features, performance on topic-specific and richer essays can be improved. Complex recurrent neural networks with contextual information can also improve the accuracy ...

  23. PDF Neural Networks for Automated Essay Grading

    accurate automated essay grading system to solve this problem. 1 Introduction Attempts to build an automated essay grading system dated back to 1966 when Ellis B. Page proved on The Phi Delta Kappan that a computer could do as well as a single human judge [1]. Since then, much effort has been put into building the perfect system. Intelligent ...

  24. Automated Essay Scoring in the Presence of Biased Ratings

    This may affect automated essay scoring models in many ways, as these models are typically designed to model (potentially biased) essay raters. While there is sizeable literature on rater effects in general settings, it remains unknown how rater bias affects automated essay scoring. To this end, we present a new annotated corpus containing ...

  25. Implementasi Metode Natural Language Processing (Nlp) Pada Automated

    Evaluation in the educational process is necessary to determine the success rate of students' learning. One of the methods used is through essay exams. Challenges arise during the essay exam assessment process as it requires significant time and effort. Additionally, there are issues with inconsistent grading, such as differences in scores despite similar meaning in the answers.