OpenAI’s ChatGPT (Chat Generative Pre-Trained Transformer) is an interactive artificial intelligence (AI) service that generates responses to users’ queries using a pre-trained deep learning algorithm that has been made publicly available. ChatGPT is the latest in a class of large-scale language models known as large language models (LLMs). LLMs are pre-trained language models that can be further trained or fine-tuned thereby expanding the range of applications and scope of LLM1,2). Compared to other LLMs, ChatGPT was reported to have a distinct improvement in conversation output3). ChatGPT goes beyond responding to user-written questions to write essays, translate languages, or generate program code because ChatGPT was designed to respond to a wide range of topics4). ChatGPT, a generative AI, autonomously creates texts, images, and videos using a self-learning algorithm. It’s often referred to as a “Hyperscale AI” due to its capacity to process vast amounts of data2). ChatGPT interacts conversationally and provides answers within a natural context, unlike search engines like Google, which deliver a list of websites related to the user’s search.
ChatGPT has been used for various tasks, including programming, translation, writing, speech writing, and, more recently, in research contexts1,2). Sarraju et al.5) reported the capability of ChatGPT by evaluating its responses to 25 questions related to the prevention of cardiovascular disease. The aim was to determine how effectively ChatGPT could handle questions concerning basic concepts in cardiovascular disease prevention. The findings revealed that ChatGPT provided appropriate responses to 21 of the 25 questions, achieving an adequacy rate of 84%. These findings emphasized the potential of conversational AI, such as ChatGPT, to enhance clinical workflows, particularly in facilitating patient education and improving communication between patients and physicians regarding the prevention of cardiovascular diseases5). Savery et al.6), assessed ChatGPT’s capability in domain-specific conversations as an interactive chatbot for psychotherapy. LLMs have also been explored in the medical fields as tools for direct interaction with patients and educational resources7). In addition, the performance of ChatGPT in the United States Medical Licensing Examination (USMLE) has been evaluated, with the model correctly answering over 60% of the questions in the first and second exams. Such a performance level, comparable to that of a third-year medical student, suggested ChatGPT’s viability as an educational tool in medical training2). Wójcik et al.8) evaluated the performance of ChatGPT in medical education and suggested that ChatGPT may be used as a valuable assistance tool in medical education. However, they emphasized that the tool cannot completely replace human knowledge and expertise. In Korea, Huh9) suggested that ChatGPT’s knowledge and interpretation ability for parasitology examination were not yet comparable to those of medical students.
To use ChatGPT as an effective educational tool in the domains of dental hygiene and dentistry, the answers it provides must be academically valid and reliable. However, the accuracy of these answers must undergo rigorous evaluation. Few studies are exploring the use of ChatGPT within educational contexts, especially within dental hygiene in Korea10). Since 1975, the national examination for dental hygienists in Korea has been conducted by the government, with the responsibility for its administration being transferred in 1998 to the Korea Health Personnel Licensing Examination Institute. The primary objective of this examination is to assess whether dental hygienists possess the necessary clinical practice competencies related to dental hygiene. Thus, the national dental hygienist examination is conducted as a standard to determine if a dental hygienist has the requisite knowledge, attitude, and skills to fulfill their professional responsibilities11).
This study aimed to evaluate ChatGPT’s performance accurately in responding to questions from the national dental hygienist examination by applying relevant knowledge, including theories and laws pertinent to dental hygiene. Moreover, through an analysis of ChatGPT’s incorrect responses, this research intends to pinpoint the predominant types of errors. The goal is to evaluate ChatGPT’s potential as a learning tool in the domain of dental hygiene education.
This was an exploratory study conducted to examine the applicability of ChatGPT as a learning tool in the field of dental hygiene by entering dental hygienist national examination questions into ChatGPT and analyzing the content it generates.
The participants of this study were the responses generated by ChatGPT-3.5 to 200 questions of the 49th national dental hygienist examination in Korea. The questions of the national examination were provided from the publicly available data on the website of the Korea Health Personnel Licensing Examination Institute. The institute provides the national examination questions and the national examination analysis report in the second half of every year. Regarding the study date, the 50th National Dental Hygienist Examination Analysis report had not been published hence the 49th National Dental Hygienist examination was chosen as the latest data for the national examination analysis report. The national dental hygienist examination consisted of 20 questions on the Health and Medical related Law, 80 questions on dental hygiene I (basic dental hygiene, dental hygiene management), and 100 questions on dental hygiene II (clinical dental hygiene).
In this study, the ChatGPT-3.5 version was selected for its advanced natural language processing capabilities to generate responses to questions from the national dental hygienist examination. To systematically analyze the answers provided by ChatGPT, a comprehensive analysis framework was developed. This framework included several components: the original questions in Korean, questions augmented with English terminology, question types, ChatGPT’s responses, categorization into correct and incorrect responses, and classification of the types of inaccuracies observed. This structured approach allowed for an evaluation of ChatGPT-3.5’s utility as an educational tool in the field of dental hygiene, focusing on its ability to accurately interpret and respond to examination questions12).
To systematically analyze the national dental hygienist examination using ChatGPT-3.5, a structured process was followed, as illustrated in Fig. 1.
To evaluate ChatGPT-3.5’s performance according to the type of national examination questions, five researchers read repeatedly 200 questions of the 49th National Dental Hygienist Examination, and classified them into 66 recall, 112 interpretation, and 22 solving type questions based on the definitions of question types provided by the Korea Health Personnel Licensing Examination Institute13). The Korea Health Personnel Licensing Examination Institute defined a recall-type question as a question that can be answered by simply remembering the memorized learning content at the level of memory. An interpretation-type question was defined as a question that can be answered by fully understanding the knowledge and dealing with new phenomena based on the knowledge. A problem-solving type question was defined as a question to solve a specific problem by applying knowledge, which requires the ability to not only interpret the information in the question but also to interpret the meaning of each option.
2) Modification of the examination questionsGiven the disparity in ChatGPT’s training data volume between English and Korean, which potentially affects its comprehension of technical terms and nuanced Korean sentence constructions14), we implemented strategic modifications to the national examination questions. To counteract potential misunderstandings arising from implied meanings or technical terminology in Korean, sentences ending with phrases such as “∼Which is correct?,” “∼structure?,” or “∼tooth?” were rephrased for clarity. For instance, questions were standardized to the format “Which of the following is correct as an explanation for ∼?”, thereby converting implicit queries into explicit ones to better convey the intended meaning. For example, the following questions were asked in Korean. “Which of the following is a correct description of the operation of a hand instrument when it is used in the patient’s mouth?” Furthermore, to bridge the language gap, we added English terms for each term when the questions and options contained only Korean or Chinese terms.
Moreover, to assess ChatGPT-3.5’s problem-solving capabilities in applying previously acquired knowledge, problem-solving questions were first converted to subjective type (Table 1). This modification removed the original multiple-choice framework of these items. If ChatGPT-3.5 generated an incorrect response when inputting a subjective-type solution question, we asked the ‘objective problem-solving questions’ that provided an original multiple-choice framework again, checked the answer, and checked the accuracy of the answer according to the subjective and objective-type questions. All questions were entered in modified form at the Chat GPT-3.5 prompt verbatim as the examples in Table 1.
Example of Converting Objective Problem-Solving Type Question into Subjective Question
Objective problem-solving type question | Subjective problem-solving type question |
---|---|
This is the result of an oral examination of a 5-year-old child. What is the appropriate space retainer? ∙The left mandibular canine is healthy. ∙The mandibular left first premolar is missing. ∙Mandibular left second premolar is pulp treated. ① Lingual arch ② Distal shoe ③ Band & loop ④ Crown & loop ⑤ Nance holding arch |
Below are the results of an oral examination of a 5-year-old child. Which space maintainer is appropriate for this child? ∙The left mandibular canine is healthy. ∙The mandibular left first premolar is missing. ∙Mandibular left second premolar is pulp treated. |
An expert review process was conducted on the categorization of national examination questions, as initially classified by five researchers according to the type of definitions provided by the Korea Health Personnel Licensing Examination Institute. The expert was selected as a professor with more than 10 years of experience in teaching dental hygiene classes and experience in developing dental hygienist national examination questions. Based on the expert’s review, the 200 questions on the national dental hygienist examination were finally categorized into 63 recall, 114 interpretation, and 23 problem-solving-type questions.
4) Analysis of ChatGPT-3.5 performanceThis study evaluated the accuracy of ChatGPT-3.5 responses by inputting 200 questions into ChatGPT-3.5. The researchers confirmed the answers generated by ChatGPT and compared them with the contents of reliable materials such as dental hygiene textbooks reading them several times. The accuracy was calculated as the percentage of correct answers among the total answers of ChatGPT. When ChatGPT’s responses were incorrect, the types of incorrect answers were classified as logical, information, and statistical errors1). A logical error was defined as an error that occurred when ChatGPT properly found information related to a question but did not correctly derive the correct answer to that question. An informational error was defined as when ChatGPT did not properly identify the key information needed to solve the input question. A statistical error was categorized as an arithmetic error in questions that required calculations. Finally, hallucination was defined when ChatGPT provided misleading information by answering something that was not true as if it were true12,14). The error classification process for each question was verified twice, and the cross-review within the research team was conducted twice and the results were summarized. The Kappa score for interrater reliability was 0.778.
Statistical analyses were performed using IBM SPSS statistics ver. 28.0 (IBM Corp., Armonk, NY, USA). Frequency analysis was conducted on the correct and incorrect responses to the 200 questions to calculate the correct and incorrect response rate. A chi-square test was conducted to examine the differences in the accuracy of responses by question type, and a McNemar test was conducted to examine whether there was a significant change in the accuracy of responses between the subjective-type solving questions and objective-type questions. p-values of less than 0.05 were considered significant.
The accuracy of ChatGPT-3.5 answers to the questions is shown in Table 2. The accuracy of ChatGPT answers for all 200 questions on the national exam was 45.5%. According to the question type, recall-type questions had the highest accuracy at 60.3%, while problem-solving-type questions had the lowest accuracy at 13.0% (p<0.001).
Accuracy of ChatGPT Responses by Dental Hygienist National Board Exam Question Type
Number of questions | Accuracy | Inaccuracy | p-valuea | |
---|---|---|---|---|
Total | 200 (100.0) | 91 (45.5) | 109 (54.5) | <0.001 |
Recall type | 63 (31.5) | 38 (60.3) | 25 (39.7) | |
Interpretation type | 114 (57.0) | 50 (43.9) | 64 (56.1) | |
Problem-solving type | 23 (11.5) | 3 (13.0) | 20 (87.0) |
Values are presented as n (%).
aCalculated by a chi-square test.
ChatGPT-3.5 had an accuracy of 13.0% for subjective problem-solving questions and 43.5% for objective problem-solution questions, showing that adding a multiple-choice framework significantly increases accuracy (p=0.016) (Table 3).
Accuracy for Subjective and Objective Problem-Solving Questions
The number of question |
Subjective problem-solving question | Objective problem-solving question | p-valuea | ||||
---|---|---|---|---|---|---|---|
Accuracy | Inaccuracy | Accuracy | Inaccuracy | ||||
Problem-solving type | 23 | 3 (13.0) | 20 (87.0) | 10 (43.5) | 13 (56.5) | 0.016 |
Values are presented as n (%).
aCalculated by a McNemar test.
ChatGPT-3.5 responded to 109 out of 200 questions incorrectly. When analyzing the error types for the 109 incorrect responses, the proportion of logical errors was 65.1%, with no statistical errors. When analyzing the types of incorrect responses by question type, logical errors were the most common for recall type among all types of questions (Table 4).
Error Classification for Incorrect Responses of ChatGPT According to the Question Type
Question types | Incorrect number of question |
Logical error | Information error | Statistical error | p-valuea |
---|---|---|---|---|---|
Total | 109 | 71 (65.1) | 38 (34.9) | 0 (0.0) | 0.848 |
Recall type | 25 | 17 (68.0) | 8 (32.0) | 0 (0.0) | |
Interpretation type | 64 | 42 (65.6) | 22 (34.4) | 0 (0.0) | |
Problem-solving type | 20 | 12 (60.0) | 8 (40.0) | 0 (0.0) |
Values are presented as n (%).
aCalculated by a chi-square test.
A total of 13 questions were categorized as incorrect for objective problem-solving questions. The errors for the 13 incorrect responses were logical errors of 76.9%, informational errors of 23.1%, and no statistical errors. When comparing the information errors of the subjective problem-solving questions with the objective problem-solving questions, the information errors decreased in the incorrect responses of the objective problem-solving questions (Table 5). Of the total 102 incorrectly answered questions included as objective problem-solving questions, 100 were categorized as hallucinations.
Errors of Incorrect Responses in ChatGPT According to Problem-Solving Type Questions
Error for subjective problem-resolving questions | Error for objective problem-resolving questions | |||||||
---|---|---|---|---|---|---|---|---|
Incorrect question | Logical error | Information error | Statistical error | Incorrect question | Logical error | Information error | Statistical error | |
20 | 12 (60.0) | 8 (40.0) | 0 (0.0) | 13 | 10 (76.9) | 3 (23.1) | 0 (0.0) |
Values are presented as n (%).
In this study, 200 questions from the 49th National Dental Hygienist examination were input into ChatGPT-3.5 one by one to evaluate if ChatGPT-3.5 can accurately derive the correct answers with appropriate dental hygiene knowledge, and then the answers generated were confirmed for accuracy and the types of incorrect answers were analyzed. As a result, ChatGPT-3.5 showed an accuracy rate of 45.5% for 200 questions. Conversely, Kung et al.1) showed that ChatGPT answered Steps 1 and 2 of more than 60% of all USMLE correctly. The results of a previous study by Kim et al.12), showed that users could get answers closer to the desired content when inputting questions in English than when inputting questions in Korean in the ChatGPT-3.5 system. It is expected that the correct answer rate will be higher if inputting the questions in English in this study.
The accuracy rate for each type of question on the national dental hygienist examination was 60.3% for recall-type questions, 43.9% for interpretation-type questions, and 13.0% for problem-resolving type questions, with the highest accuracy rate for recall-type questions. Recall-type questions are those that can be answered by recalling the memorized learning content, while interpretation-type questions are those that check the ability to fully understand the learned knowledge and cope with new situations based on existing knowledge. From these results, it can be concluded that ChatGPT-3.5’s ability to apply knowledge to judge the clinical data given is weaker than the ability to simply answer the memorized content.
To get a more accurate interpretation of ChatGPT’s problem-solving performance, we first evaluated its accuracy on subjective problem-solving questions with the multiple-choice items removed. As a result, we found that ChatGPT-3.5’s accuracy for subjective problem-solving questions was 13.0%. For questions classified as incorrect, the accuracy of ChatGPT-3.5 was increased to 43.5% by giving the correct answer to the previously classified incorrect questions after providing additional options. In addition, the accuracy of ChatGPT-3.5 for problem-solving questions in this study was found to be higher with multiple-choice items than without multiple-choice items. Gilson et al.2) found that ChatGPT increased the percentage of correct answers when additional information was provided during the problem-solving process, which is consistent with the results of this study. Therefore, it appeared that ChatGPT-3.5 utilized more information for problem-solving questions when options were present or more information was provided by the user, resulting in more accurate answers.
In this study, we analyzed the errors according to the question type and found that logical errors were higher than informational errors for the recall, interpretation, and problem-solving types of questions, and no statistical errors were found. Logical errors were when ChatGPT properly found information about a problem but failed to properly translate that information into a response, while informational errors were when they failed to properly identify key information needed to solve a given problem. From these results, it could be interpreted that ChatGPT-3.5 was relatively good at identifying key information about the input question, but was limited in its ability to properly translate it into a correct response. A study evaluating the accuracy of ChatGPT on the USMLE2) reported that inaccurate responses were more likely to contain logical errors than information errors. Because ChatGPT is constantly changing the way it learns depending on the model, and its functionality depends on the amount of information covered and accumulated, it is difficult to draw clear conclusions about its accuracy and types of incorrect answers at this point, but we believe that the results of this study and the studies reported to date show similar patterns.
In addition, ChatGPT-3.5 responded to 100 of the 102 questions categorized as incorrect by stating untrue statements as if they were true. This suggested that ChatGPT’s ability to properly find relevant information about the question and derive the right answer was less than its ability to get to the point of the matter. The previous study2) suggested that ChatGPT users should be aware that the answers given by ChatGPT may be well-founded but inaccurate. Other previous studies have suggested that ChatGPT could be utilized as a valuable tool in medical education, but emphasized that ChatGPT cannot fully replace human expertise8,9). Therefore, the users should continue to attempt to assess the accuracy of ChatGPT’s responses independently of their perception of ChatGPT’s ability to answer any question. Based on this study and previous reports, it was not yet possible to fully trust the responses of ChatGPT-3.5. Therefore, to use ChatGPT-3.5 as a tool for educational and clinical applications in dental hygiene, it is recommended that dental hygiene majors and dental hygienists with dental hygiene knowledge finally confirm the accuracy of ChatGPT’s answers rather than blindly trusting the chatbot’s answers. In other words, dental hygienists in the education and clinical fields will need to have the right knowledge and the ability to utilize AI-generated data with a critical view. When dental hygienists use the ChatGPT in the education and clinical fields, dental hygienists should be able to respond to the error by comprehensively utilizing their knowledge and expertise as a way to deal with each error type. Therefore, based on the results of this study, we should consider the direction of dental hygiene education in the era of the Fourth Industrial Revolution. There have been various attempts to utilize AI, ChatGPT, etc. in the medical and educational fields, but it would not be possible to completely replace human intelligence. In the current era where a myriad of diverse information is poured out, it is necessary to identify and distinguish the fields that could not be handled only by AI, not humans, and the competencies that cannot be replaced, and to train human resources who could perform such competencies in these fields. In addition, it was necessary to train human resources who could fully utilize and apply AI, which was developing in various forms, rather than distinguishing it regardless of AI.
This study has several limitations. When this study was first conducted, the ChatGPT-4 model had not been released, so it was not possible to confirm the functional differences with ChatGPT-4 as the newer model. Therefore, it is necessary to evaluate whether the ChatGPT-4 model could utilize the correct dental hygiene knowledge base to generate appropriate answers in future studies. In addition, since ChatGPT developed by OpenAI was significantly lacking in Korean information compared to English, caution seems necessary when interpreting the accuracy of ChatGPT-3.5 based on the results of this study. It is necessary to conduct a study to fully evaluate the functionality of ChatGPT by inputting and utilizing English information. Since ChatGPT is constantly being improved and supplemented, it should be kept in mind that if the dental hygienist national examination questions used in this study are input into ChatGPT again in the future, they might generate different answers from this study and might show higher correct answer rate15). Nevertheless, the significance of this study was that it was the first study to evaluate the accuracy and usefulness of ChatGPT in Korean dental hygiene education. Based on the results of this study, dental hygiene majors or dental hygienists should be able to independently determine the accuracy of their responses when using ChatGPT as a tool for dental hygiene education and dental hygienist practice.
As a result of evaluating the performance of ChatGPT-3.5 on the Korean national dental hygienist examination questions, the overall accuracy of the responses was 45.5%. According to the question type, the percentage of correct responses was significantly lowest for the problem-solving questions and the highest for the recall-type questions. There were also found that logical errors were higher than information errors in the types of incorrect responses. It was evaluated that ChatGPT-3.5 was relatively unable to provide accurate answers in the problem-solving process for the Korean national dental hygienist examination, especially in the case of interpretation or problem-solving type that utilized and applies knowledge to solve compared to recall-type questions that need to be solved with simple knowledge. Therefore, dental hygiene majors or dental hygienists who want to use ChatGPT in dental hygiene should be able to independently confirm the accuracy of their responses when using ChatGPT.
None.
No potential conflict of interest relevant to this article was reported.
Because this study was not research involving humans, Institutional Review Board approval was not required.
Conceptualization: Soo-Myoung Bae, Hye-Rim Jeon, Gyoung-Nam Kim, Seon-Hui Kwak, and Hyo-Jin Lee. Data acquisition: Soo-Myoung Bae, Hye-Rim Jeon, Gyoung-Nam Kim, and Seon-Hui Kwak. Formal analysis: Soo-Myoung Bae, Hye-Rim Jeon, Gyoung-Nam Kim, Seon-Hui Kwak, and Hyo-Jin Lee. Supervision: Soo-Myoung Bae and Hyo-Jin Lee. Writing-original draft: Hye-Rim Jeon and Gyoung-Nam Kim. Writing-review & editing: Soo-Myoung Bae, Seon-Hui Kwak, and Hyo-Jin Lee.
None.
Raw data is provided at the request of the corresponding author for reasonable reason.