Automated Essay Scoring: Applications to Educational Technology

 

Peter W. Foltz

Dept. of Psychology,
Box 30001, Dept. 3452
New Mexico State University
Las Cruces, NM, 88003
pfoltz@nmsu.edu

Darrell Laham

Knowledge Analysis Technologies
625 Utica Avenue
Boulder, CO, 80304
dlaham@psych.colorado.edu

Thomas K Landauer

Dept. of Psychology
Box 344
University of Colorado
Boulder, CO 80309
landauer@psych.colorado.edu

 

Abstract

The Intelligent Essay Assessor (IEA) is a set of software tools for scoring the quality of essay content. The IEA uses Latent Semantic Analysis (LSA), which is both a computational model of human knowledge representation and a method for extracting semantic similarity of words and passages from text. Simulations of psycholinguistic phenomena show that LSA reflects similarities of human meaning effectively. To assess essay quality, LSA is first trained on domain-representative text. Then student essays are characterized by LSA representations of the meaning of their contained words and compared with essays of known quality on degree of conceptual relevance and amount of relevant content. Over many diverse topics, the IEA scores agreed with human experts as accurately as expert scores agreed with each other. Implications are discussed for incorporating automatic essay scoring in more general forms of educational technology.

Introduction

While writing is an essential part of the educational process, many instructors find it difficult to incorporate large numbers of writing assignments in their courses due to the effort required to evaluate them. However, the ability to convey information verbally is an important educational achievement in its own right, and one that is not sufficiently well assessed by other kinds of tests. In addition, essay-based testing is thought to encourage a better conceptual understanding of the material on the part of students and to reflect a deeper, more useful level of knowledge and application by students. Thus grading and criticizing written products is important not only as an assessment method, but also as a feedback device to help students better learn both content and the skills of thinking and writing. Nevertheless, essays have been neglected in many computer-based assessment applications since there exist few techniques to score essays directly by computer. In this paper we describe a method for performing automated essay scoring of the conceptual content of essays. Based on a statistical approach to analyzing the essays and content information from the domain, the technique can provide scores that prove to be an accurate measure of the quality of essays.

The text analysis underlying the essay grading schemes is based on Latent Semantic Analysis (LSA). Detailed treatments of LSA, both as a theory of aspects of human knowledge acquisition and representation, and as a method for the extraction of semantic content of text are beyond the scope of this article. They are fully presented elsewhere (Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990; Landauer & Dumais, 1997; Landauer, Foltz & Laham, 1998), as are a number of simulations of cognitive and psycholinguistic phenomena that show that LSA captures a great deal of the similarity of meanings expressed in discourse (Rehder, Schreiner, Wolfe, Laham, Landauer, & Kintsch, 1998 ; Wolfe, Schreiner, Rehder, Laham, Foltz, Kintsch, & Landauer, 1998).

The LSA similarity between words and passages is measured by the cosine of their contained angle in a 300-dimensional "semantic space". The LSA measured similarities have shown to closely mimic human judgments of meaning similarity and human performance based on such similarity in a variety of ways. For example, after training on about 2,000 pages of English text, it scored as well as average test-takers on the synonym portion of TOEFL–the ETS Test of English as a Foreign Language (Landauer & Dumais, 1997). After training on an introductory psychology textbook, it achieved passing scores on two different multiple-choice exams used in introductory psychology courses (Landauer, Foltz & Laham, in preparation). This similarity comparison made by LSA is the basis for performing automated scoring of essays through comparing the similarity of meaning between essays.

Automated scoring with LSA

While other approaches to automatic evaluation of written work have focussed on mechanical features, such as grammar, spelling and punctuation, there are other factors involved in writing a good essay. For example at an abstract level, one can distinguish three properties of a student essay that are desirable to assess; the correctness and completeness of its contained conceptual knowledge, the soundness of arguments that it presents in discussion of issues, and the fluency, elegance, and comprehensibility of its writing. Evaluation of superficial mechanical and syntactical features is fairly easy to separate from the other factors, but the rest–content, argument, comprehensibility, and aesthetic style–are likely to be difficult to pull apart because each influences the other, if only because each depends on the choice of words.

Although previous attempts to develop computational techniques for scoring essays have focused primarily on measures of style (e.g., Page, 1994), indices of content have remained secondary, indirect and superficial. In contrast to earlier approaches, LSA methods concentrate on the conceptual content, the knowledge conveyed in an essay, rather than its style, or even its syntax or argument structure.

To assess the quality of essays, LSA is first trained on domain-representative text. Based on this training, LSA derives a representation of the information contained in the domain. Student essays are then characterized by LSA vectors based on the combination of all their words. These vectors can then be compared with vectors for essays or for texts of known content quality. The angle between the two vectors represents the degree to which the two essays discuss information in a similar. For example, an ungraded essay can be compared to essays that have already been graded. If the angle between two essays is small then those essays should be similar in content. Thus, the semantic or conceptual content of two essays can be compared and a score derived based on their similarity. Note that two essays can be considered to have almost identical content, even if they contain few or none of the same words, as long as they convey the same meaning.

Evaluating the effectiveness of automated scoring

Based on comparing conceptual content, several techniques have been developed for assessing essays. Details of these techniques have been published elsewhere and summaries of particular results will be provided below. One technique is to compare essays to ones that have been previously graded. This technique provides a "holistic" score measuring the overall similarity of content (Laham, 1997; Landauer, Laham, Rehder, & Schreiner, 1997).

The holistic method has been tested on a large number of essays over a diverse set of topics. The essays have ranged in grade level, including middle school, high school, college and college graduate level essays. The topics have included essays from classes in introductory psychology, biology, history, as well as essays from standardized tests, such as analyses of arguments, and analyses of issues from the ETS Graduate Management Achievement Test (GMAT). For each of these sets of essays, LSA is first trained on a set of texts related to the domain. Then the content of each of the new essays is compared against the content of a set of pre-graded essays on the topic. In each case, the essays were also graded by at least two course instructors or expert graders, for example professional readers from Educational Testing Service, or other national testing organizations. Across the datasets, LSA's performance produced reliabilities within the range of their comparable inter-rater reliabilities and within the generally accepted guidelines for minimum reliability coefficients. For example, in a set of 188 essays written on the functioning of the human heart, the average correlation between two graders was 0.83, while the correlation of LSA's scores with the graders was 0.80. A summary of the performance of LSA's scoring compared to the grader-to-grader performance across a diverse set of 1205 essays on 12 topics is shown in Figure 1. The results indicate that LSA's reliability in scoring is equivalent to that of human graders.

In a more recent study, the holistic method was used to grade two additional questions from the GMAT standardized test. The performance was compared against two trained ETS graders. For one question, a set of 695 opinion essays, the correlation between the two graders was 0.86, while LSA's correlation with the ETS grades was also 0.86. For the second question, a set of 668 analysis of argument essays, the correlation between the two graders was 0.87, while LSA's correlation to the ETS grades was 0.86. Thus, LSA was able to perform near the same reliability levels as the trained ETS graders.

 

Figure 1. Summary of reliability results (N = 1205 Essays on 12 Diverse Topics)

While the holistic technique relies on comparing essays against a set of pre-graded essays, other techniques have been developed that also effectively characterize the quality of essays. A second technique is to compare essays to an ideal essay, or "gold standard" (Wolfe et al., 1998), . In this case, a teacher can write his or her ideal essay and all student essays are then judged based on how close they are to the teacher's essay. In two final techniques, essays can be compared to portions of the original text, or compared to sub-components of texts or essays (Foltz, 1996; Foltz, Britt & Perfetti, 1996). In this final componential approach, individual sentences from a student's essay can be compared against a set of predetermined subtopics. This permits the determination of whether an essay sufficiently covers those subtopics. LSA derived scores based on the degree of coverage of subtopics in essays show equivalent correlations with human graders as graders correlate with each other

Anomalous essay checking

While it is important to verify the effectiveness of computer-based essay grading, it is also important that such a grader be able to determine if it can not grade an essay reliably. Thus, a number of additional techniques have been developed that can detect "anomalous" essays. If an essay is sufficiently different from the essays for which it has been trained, the computer can flag it for human evaluation. We currently flag ones that are highly creative, off topic, or violate standard formats or structures for essays. In addition, the computer can determine whether any essay is too similar to other essays or to the textbook. The program is thus able to detect different levels of plagiarism. If an essay is detected as being anomalous for any reason, the essay can be automatically forwarded to the instructor for additional evaluation.

Experiences in the classroom: grading and feedback

Over the past two years, the Intelligent Essay Assessor has been used in a course in Psycholinguistics at New Mexico State University. Designed as a web-based application, it permits students to submit essays on a particular topic from their web browsers. Within about 20 seconds, students receive feedback with an estimated grade for their essay and a set of questions and statements of additional subtopics that are missing from their essays. Students can revise their essays immediately and resubmit. A demonstration is available at: http://psych.nmsu.edu/essay

To create this system, LSA was trained on portion of the psycholinguistics textbook used in the class. The holistic grading method was used to provide an overall grade for any essay. In this method, each essay was compared against 40 essays from previous years that had been graded by three different graders. To verify the effectiveness of this approach for providing accurate grades, the average correlation among the three human graders was 0.73 while the average correlation of LSA’s holistic grade to the individual graders was 0.80. To provide feedback about missing information in each essay, individual sentences in each essay were compared against sentences that corresponded to subtopics of the essay topic. If no sentence was found that matched a subtopic then the student received feedback about the fact that their essay did not properly cover that subtopic.

Students were permitted to use the system independently to write their essay and were encouraged to revise their essays and resubmit them to the computer as many times as they wanted until they were satisfied with their grades. The average grade for the students’ first essays was 85 (out of 100). By the last revision, the average grade was 92. Students’ essays improved from revision to revision, with the improvements in scores ranging from 0 to 33 points over an average of 3 revisions. An additional trial is underway with a similar system in a Boulder Colorado middle school in which students summarize texts on sources of energy. Preliminary results indicate that two-thirds of the students were able to improve their summaries based on the IEA's feedback.

In both the undergraduate and middle school trials, students and teachers have enjoyed and valued using the system. A survey of usability and preferences for/against the system in the psycholinguistics course showed that 98 percent of the students indicated that they would definitely or probably use such a system if it were available for their other classes. Overall, the results show that the IEA is successful at helping students improve the quality of their essays through providing immediate and accurate feedback.

Implications of computer-based essay grading for education

There exist a variety of applications within education to which the IEA can be applied. At a minimal level, the IEA can be used as a consistency checker, in which the teacher grades the essays and then the IEA re-grades the essays and indicates discrepancies between the two grades. Because the IEA is not influenced by fatigue, deadlines, or biases, it can provide a consistent and objective view of the quality of the essays. The IEA can further be used in large scale standardized testing or large classes, by either providing consistency checks or serving as an automatic grader.

At a more interactive level, the IEA can be used to help students improve their writing through assessing commenting on it. By providing instantaneous feedback about the quality of their essays, as well as indications of information missing from their essays, students can use the IEA as a tool to practice writing content-based essays. Due to a lack of sufficient teachers and aides in many large section courses, writing assignments and essay exams have often been replaced by multiple choice questions. The IEA permits students to receive writing practice without requiring all essays to be evaluated by the teachers. Because the IEA's evaluations are immediate, students can receive feedback and make multiple revisions over the course of one session. This overall approach is consistent with the goals of the "Writing across the Curriculum" approach by allowing the introduction of more writing assignments in courses outside of English or the humanities. Thus, the IEA can serve to improve learning through writing.

Finally, the IEA can be integrated with other software tools for education. In most software for distance education, tools for administering writing assignments are often neglected in favor of tools for creating and grading multiple choice exams. The IEA permits writing to be a more central focus. For example, in web-based training systems, the IEA can be incorporated as an additional module which would permit teachers to add writing assignments for their web courses. Essays can be evaluated on a secure server and scores can be returned directly either to the students or to the teachers along with the essays. Textbook supplements can similarly use the IEA for automated study guides. At the end of chapters, students can be asked to write essay questions addressing topics covered in the chapter. Based on an analysis of their essays, the software can suggest sections of the textbook that the students needs to review before they take their exams.

Conclusions

The Intelligent Essay Assessor presents a novel technique for assessing the quality of content knowledge expressed in essays. It permits the automatic scoring of short essays that would be used in any kind of content-based courses. Based on evaluations of the IEA over a wide range of essay topics and levels, the IEA proves as reliable at evaluating essay quality as human graders. Its scores are also highly consistent and objective. While the IEA is not designed for assessing creative writing, it can detect whether an essay is sufficiently different that it should be graded by the instructor. The IEA can be applied for both distance education and for training in the classroom. In both cases it may be used for assessing students content knowledge as well as providing feedback to students to help improve their learning. Overall, the IEA permits increasing the amount of writing assignments without overly increasing the grading load on the teacher. Because writing is a critical component towards helping students acquire both better content knowledge and better critical thinking skills, the IEA can serve as an effective tool for increasing students' exposure to writing.

 

Acknowledgements

The web site http://lsa.colorado.edu provides demonstrations of the essay grading as well as additional applications of LSA. In addition, it provides links to many of the articles cited. This research was supported in part by a contract from the Defense Advanced Research Projects Agency-Computer Aided Education and Training Initiative to Thomas Landauer and Walter Kintsch, a grant from the McDonnell Foundation’s Cognitive Science in Educational Practice program to W. Kintsch, T. Landauer, & G. Fischer, and an NMSU Dean's small grant to Peter Foltz.

The "Intelligent Essay Assessor" has a patent pending: Methods for Analysis and Evaluation of the Semantic Content of Writing by P. W. Foltz, D. Laham, T. K. Landauer, W. Kintsch & B. Rehder, held by the University of Colorado

 

References

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing By Latent Semantic Analysis. Journal of the American Society For Information Science, 41, 391-407.

Foltz, P. W. (1996) Latent Semantic Analysis for text-based research. Behavior Research Methods, Instruments and Computers. 28(2), 197-202.

Foltz, P. W., Britt, M. A., & Perfetti, C. A. (1996) Reasoning from multiple texts: An automatic analysis of readers' situation models. In G.W. Cottrell (Ed.) Proceedings of the 18th Annual Cognitive Science Conference.(pp. 110-115), Hillsdale, NJ: Lawrence Erlbaum Associates.

Laham, D. (1997). Automated holistic scoring of the quality of content in directed student essays through Latent Semantic Analysis. Unpublished master’s thesis, University of Colorado, Boulder, Colorado.

Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211-240.

Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Latent Semantic Analysis passes the test: knowledge representation and multiple-choice testing. Manuscript in preparation.

Landauer, T. K, Foltz, P. W. & Laham, D. (1998) An introduction to Latent Semantic Analysis. Discourse Processes, 25, 2&3, 259-284.

Landauer, T. K., Laham, D., Rehder, B., & Schreiner, M. E., (1997). How well can passage meaning be derived without using word order? A comparison of Latent Semantic Analysis and humans. In M. G. Shafto & P. Langley (Eds.), Proceedings of the 19th annual meeting of the Cognitive Science Society (pp. 412-417). Mahwah, NJ: Erlbaum.

Page, E. B. (1994). Computer grading of student prose using modern concepts and software. Journal of Experimental Education, 62 127-142.

Rehder, B., Schreiner, M. E., Wolfe, B. W., Laham, D., Landauer, T. K., & Kintsch, W. (in press). Using Latent Semantic Analysis to assess knowledge: Some technical considerations. Discourse Processes.

Wolfe, M., B. Schreiner, M. E., Rehder, B., Laham, D., Foltz, P. W., Kintsch, W. & Landauer, T. K (1998). Learning from text: Matching readers and texts by Latent Semantic Analysis. Discourse Processes, 25, 2&3, 309-336.