Edited by: Brantina Chirinda, University of California, Berkeley, United States
Reviewed by: Joseph Njiku, Dar es Salaam University College of Education, Tanzania
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
The rapid advancements in artificial intelligence (AI) have sparked interest in its application within mathematics education, particularly in automating the coding and grading of student solutions. This study investigates the potential of ChatGPT, specifically the GPT-4 Turbo model, to assess student solutions to procedural mathematics tasks, focusing on its ability to identify correctness and categorize errors into two domains: “knowledge of the procedure” and “arithmetic/algebraic skills.” The research is motivated by the need to reduce the time-intensive nature of coding and grading and to explore AI's reliability in this context. The study employed a two-phase approach using a dataset of handwritten student solutions of a system of linear equations: first, ChatGPT was trained using student solutions that were rewritten by one of the authors to ensure consistency in handwriting style; its performance was then tested with additional solutions, also in the same handwriting. The findings reveal significant challenges, including frequent errors in handwriting recognition, misinterpretation of mathematical symbols, and inconsistencies in the categorization of mistakes. Despite iterative feedback and prompt adjustments, ChatGPT's performance remained inconsistent, with only partial success in accurately coding solutions. The study concludes that while ChatGPT shows promise as a coding aid, its current limitations—particularly in recognizing handwritten inputs and maintaining consistency—highlight the need for improvement. These findings contribute to the growing discourse on AI's role in education, emphasizing the importance of improving AI tools for practical classroom and research applications.
香京julia种子在线播放
The field of artificial intelligence (AI) has seen significant advancements that continue to astonish. The capacity to produce texts, generate images, and conduct image analysis is now attainable through this technological medium. The rapid developments of AI capabilities have given rise to discussions about its use in the field of mathematics education (e.g., Pepin et al.,
Correcting and grading student answers is a time-consuming and labor-intensive undertaking for teachers (Liu et al.,
Although the focus in computer-based mathematics examinations is on standard procedures due to their easier feasibility (Hoogland and Tout,
Since we are focusing on tasks that test procedural knowledge, it is important to clarify what we mean by procedural knowledge. One particularly recent definition is that proposed by Altieri (
Initial research aim: In light of this, the original objective of our study was to assess the efficacy of ChatGPT in identifying and categorizing mistakes in handwritten student solutions to procedural tasks. It was important to us that this research process takes place within an easy-to-imitate framework. This would allow us to check whether teachers or researchers with limited AI knowledge could use ChatGPT for such purposes.
For our research aim, we needed both a training dataset (the size of one school class) and a test dataset of the same size. The data of the OFF project (Ableitinger and Dorner,
As a start, we first selected one task—PA07—in the sense of a case study in which mistakes have occurred that are as different as possible, and which therefore provide an interesting database for training or testing an AI as a coding aid:
Task PA07: System of linear equations
Find the solution set of the system of equations
in ℝ2 by addition (elimination) method.
In a first step, we trained ChatGPT-4 Turbo
We have generally formulated role-based prompts that ChatGPT should elevate to an expert role in order to achieve associated positive effects, e.g., output clarity, output depth, professionalism and use of appropriate technical language (Kambhatla et al.,
We entered the following two prompts into the chat one after the other:
Prompt 1 (definition):
Prompt 2 (training):
If the student's solution was coded in the same way as in the double coding by Dorner et al. (
Student solutions to Task PA07 and version digitized by ChatGPT.
Our response was: “Indeed, the solution is incorrect. But: There is a mistake in the knowledge of the procedure, because he chose the wrong equivalent transformation (:2
In a second step, we wanted to test how well the trained ChatGPT could code further student solutions of the same task. To do this, we used another 18 student solutions (again rewritten by the same author) to Task PA07 and prompted the following to test ChatGPT:
Prompt 3 (test):
When evaluating the initial test data, we made the following observation, which led us to modify the original research project. The first input of test data in ChatGPT, an image file containing 10 student solutions was assessed as correct in all respects, both the identification of correctness and the mistake categorization. Interestingly, the second file, which contained eight student solutions, was not assessed as good as the first. Only four of the solutions were correctly coded.
This drew our attention to the potential issue of font recognition deficiencies. It is conceivable that ChatGPT is not yet capable of identifying handwritten solutions accurately although they were provided with a new and easily legible handwriting. This led us to a new perspective considering the following research question:
How effective is GPT-4 Turbo in the digitization of handwritten procedural task solutions, and what types of errors occur during this process?
To determine how effectively ChatGPT (specifically the GPT-4 Turbo version) can recognize handwritten solutions to procedural mathematics tasks and convert them into a digital format, we used the following prompt twice. The first time, we provided a file containing 13 student solutions to Task PA07. The second time, we used a different file containing 9 student solutions:
Prompt 4 (digitization):
The student solutions in the two files were each labeled with a unique student code. Out of the total of 22 students, 2 did not complete the tasks. Of the remaining 20 solutions, ChatGPT successfully recognized and digitized 4 without any errors, while the other 16 were digitized with inaccuracies.
During the digitization process by ChatGPT, a number of errors occurred, some of which are quite interesting. For instance, in the student solution labeled 21_S03_A03, ChatGPT failed to adhere to the rules of bracket conventions. For example, the expression should read 3
In the case of 21_S03_A06, multiple translation errors occurred simultaneously. For instance, instead of writing /·(−2), ChatGPT incorrectly used /:(−2) in the first line. This is particularly noteworthy because the student had indeed applied the operation /·(−2) correctly, meaning that ChatGPT introduced a mistake during the digitization process. This operation, as misrepresented by ChatGPT, does not align with the equation written by the student in the subsequent line. Later, we will observe the reverse phenomenon: despite being explicitly instructed in the prompt not to do so, ChatGPT corrects mistakes in the student solutions. Additionally, the horizontal line that should appear between the second and third lines is missing in ChatGPT's output. The operation +4
The incorrect transcription of the equation 2
In 21_S05_A01, the student incorrectly simplified the equation 11
In the next step, we provided ChatGPT with feedback on all these errors and included the necessary corrections. ChatGPT incorporated these corrections into a revised output. However, this led to new errors that were not present in the initial output. In one instance, ChatGPT even transferred three lines from solution 21_S03_A07 into solution 21_S03_A06. Additionally, we sought to address the issue of ChatGPT making corrections to the student solutions despite our explicit instructions in the prompt not to do so. To tackle this, we asked ChatGPT how we should formulate our prompt to prevent such behavior and subsequently adjusted the prompt accordingly:
Prompt 5 (accuracy):
The error rate was reduced to approximately half compared to the initial output during this correction cycle. In total, we conducted four such correction cycles. After the fourth cycle, only the student solution 21_S03_A06 remained erroneous. To address this, we provided ChatGPT with line-by-line instructions for this specific solution.
The transfer of lines from one solution into another suggests that ChatGPT may struggle to handle the digitization of multiple student solutions simultaneously. To investigate this, we attempted to have ChatGPT digitize the student solutions individually (i.e., each in a separate file). However, similar errors occurred as in the case when multiple solutions were processed at once. In one instance, ChatGPT even stopped midway through the digitization process, producing an incomplete output.
Our initial euphoria has now turned into disappointment with ChatGPT's performance. According to Hoogland and Tout (
As previously mentioned, Liu et al. (
However, it should be noted that problems could have arisen from the intermediate step of character recognition, which possibly would not have occurred if we had only considered the ChatGPT's output of the mistake analysis as planned initially (e.g., due to internal working processes of ChatGPT that are not visible to us). Further research is required in this area.
This also has the following educational implications: As long as AI models cannot reliably recognize students' handwriting, they are not a reliable aid for teachers in grading and diagnosis. We recommend using AI at most for initial analysis, which must in any case be checked and adapted by the teacher (cf. Liu et al.,
Our conclusion in terms of this research framework and its use in classroom or research for coding or grading student solutions is: not yet.
The datasets presented in this article are not readily available because the data collected may only be used for scientific purposes and may only be published in anonymised form. Requests to access the datasets should be directed to Christoph Ableitinger,
CA: Methodology, Project administration, Resources, Data curation, Validation, Visualization, Conceptualization, Investigation, Formal analysis, Writing – original draft, Writing – review & editing, Software. CD: Investigation, Resources, Software, Visualization, Data curation, Conceptualization, Formal analysis, Validation, Project administration, Methodology, Writing – original draft, Writing – review & editing.
The author(s) declare that no financial support was received for the research and/or publication of this article.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The author(s) declare that Gen AI was used in the creation of this manuscript. Gen AI was used for language improvement of the article and ChatGPT was used as a tool in the study for recognising handwriting and coding solutions provided by students.
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
1ChatGPT-4 Turbo is an optimized variant of GPT-4, designed to deliver comparable performance with significantly improved speed and efficiency.