Thursday 22 August 2019

The perils of AI essay assessment

    Dear reader, given a choice between doing a work slowly incurring greater costs and doing a work fast, inaccurately and most importantly, cheaply, what would you choose? More details about the context? I am here to discuss with you today a phenomenon sweeping across global standardised testing. It is the rise of computers, instead of human beings, grading not only objective multiple-choice questions but rather subjective prose.
    The right to education ensures an increasing numbers of students every year given the rise in population. The investment in the education sector in increasing the number of service providers or teachers does not keep pace with the rate of growth of students. As a result, teachers are often over-burdened. There are two primary aspects of education—one, where the student hears or sees, and the second where the student writes or performs based on what s/he has learnt. While the first aspect, that is making students hear or see a part of the curriculum often finds less difficulty in tackling increasing student populations (solution: cram the classrooms or pre-record the lectures and make the students see the same spiel without any fresh contribution from the teacher), what the education sector struggles with is handling the assessment of the students’ contributions.
    If the baby’s bath water is dirty, you throw out the water, and sometimes also the baby. When subjective answers where students had to write or perform a calculation in detail became difficult to tackle, the first change that policy makers adopted was to abandon this assessment system and adopt a multiple-choice question pattern, where one had to fill in a circle beside the correct answer. The standard mark of the filled-in circle was read by an optical machine reader thereby eliminating the need for subjective assessment by human assessors.
    Some policy makers still wanted to incorporate subjective prose answers as part of the assessment exercise. The problem was how to not involve too many human beings in the exercise and find a cheaper solution. A solution has been found which is fashionably labelled ‘artificial intelligence’ whose earlier manifestations were more humbly termed ‘pattern recognition’. What it implies is that computer programmes are trained to look at a large sample set of human-graded essays, find patterns in it, look for those patterns in other essays and mark them similar to what human beings had done in the sample set. Where lies the problem, you may ask. Isn’t that the way all human learning is structured?
    We learn to speak our first words by observing our guardians and repeating noises that they make while doing the things that we want to do. The problem is our guardians are not always the best role models to follow. It is true that our guardians shape our consciousness significantly but we are not circumscribed by our circumstances. Some revolutionaries begin by rebelling against their guardians and their morals. The sample set of patterns is neither always the ideal set of patterns nor always inclusive of all iterations. A pattern is a trait which has multiple iterations. Original enunciations (neologisms) or sole iterations (hapax legomenon) are not part of existing patterns. If one were to compose out-pattern utterings, machine learning would fail to recognise it as conforming to high standards and would penalise such composition. Thus, machine learning rewards conventionality. Machine learning and artificial intelligence are often described as self-learning algorithms. The problem with self-learning machines is that if it is both learning and performing its task (grading essays in this case) at the same time, by the time a neologism becomes a pattern through multiple iterations, many essays have already been marked where it has not been assessed properly for not conforming to high standards.
    One may argue that learning and performing a task is one of the fundamental ways of learning on the job. Medical trials are held according to this process whereby the effects of a placebo are compared to the effects of the drug under study and whichever set of patients come off worse are considered collateral damage for the greater good. However, in medical trials, often there are no alternatives whereby damage can be minimised. On the contrary, in the case of self-learning algorithms, the alternative is that if humans were to assess the essay, there would be less damage to the greater good.
    If it is a sole iteration, such algorithms fail to recognise its validity. One is thus cast into an echo chamber where each cry loudens the dominant echo and it takes a deluge of a different sort to drown out the initial echo. However, drowning out an initial echo by a different one implies that the new echo has become the convention. What the system rewards is conventionality. There is little scope for original expression.
    Even if one were to assess conventionality, however huge a sample set is, it is rarely inclusive of the entire corpus. The entire corpus of English prose writing is unimaginable as it is like the concept of present time that cannot be grasped. Every culture and sub-sect has its own style of writing. Black American English, Dalit Indian English, Raymond Chandler and James Joyce—all have significantly different language registers. A software package that a testing agency may buy may not have been trained on such texts. Different communities may find their essays being graded low for not conforming to high standards as per the sample set on which the algorithm was trained. Thus, cultural hegemony of a kind is imposed.
    Would this bias not be there had there been human assessors instead of computers? Would they not be subconsciously incorporating their biases while grading subjective prose? They would but human assessors are trained to grade answers according to coherence of arguments, accuracy of claims and also usage of language. It is easier for a human being to spot neologisms or specificity of language registers as opposed to a computer. Such features of human cognition have not yet been successfully incorporated by computer ‘natural language processing’.
    Coherence of argument is also a critical aspect of assessment where humans outperform computers. The scope of arguments is infinite. Diversions are often the means of progression in an argument. Extremely long diversions, such as in Homeric epic similes, may not be found in most essays fed to a computer. Nor is allusive discourse very prominent. Answers written in the style of George Orwell’s Animal Farm may not be common among student essays. However, the point is that they can be there. Unconventional forms of writing are what make reading and writing about common topics less tedious. More than tedium, writing is a form of human expression and one of the fundamental rights. One world, one style or universality of a certain style is not only politically incorrect but also hugely restrictive.
    Dear reader, I do not wish to come across as a luddite, forsaking technological experimentation for the sake of safety. However, the effort should be to minimise the damage. Many American testing agencies already employ AI-driven essay assessment and their makers have sometimes publicly acknowledged the concerns that this piece raises.
    Since solutions have not yet been found, it is better to conduct trials with small sets where both humans and machines are asked to assess the same essay, and a third human adjudicator can choose which grade to formally associate the essay with. However, financial imperatives demand urgent reduction of investment in human capital. As a result, machines are what we are often left with. Even in situations where this technological investment has already been made, it is best to include the role of a human assessor who uses the warnings of errors in spelling and inconsistencies of grammar pointed out by the software in the digital output created by students, to make a more human assessment of an exercise which is difficult to strait-jacket. Countries, such as India, which have not yet invested much in this technological experimentation, are better off carrying out small tests before implementing such models on a large scale.