To the editor
It is expected that artificial intelligence (AI) medical devices would be introduced in medical images earlier than in other fields, and in fact, there are many cases where AI-based bone age (BA) medical devices are used in primary medical institutions. Besides, there are few studies comparing BA and final adult height (FAH) prediction of humans and AI [
1-
3].
Medical image reading by AI is by deep learning based on the Greulich-Pyle (GP) and tanner-white house methods, and this study aimed to compare the accuracy of BA and FAH prediction of VUNO Med-BoneAge (VUNO Inc., Seoul, Korea), the most commonly used AI program in Korea and specialist.
Our study included 190 children and adolescents (73 males and 117 females) aged 8–12 years who visited the Growth Clinic of Pediatric Endocrinology in 2012 to evaluate their BA and predicted adult height (PAH). The height, weight, body mass index, and height of the father and mother were retrospectively reviewed, and their BA and PAH were predicted based on the GP method by pediatric endocrinologist and musculoskeletal specialist of radiology in 2012 and parents’ and subject’s height in 2021 were collected by a telephone survey [
4,
5].
Those who had chronic diseases, treatment to improve growth, height below 3 percentile or above 97 percentile in 2012, the difference in chronological age and BA is above 2 years, did not reach FAH by 2021 were excluded. Of the total 961 individuals (322 males and 639 females), 190 (73 males and 117 females) were studied.
In this study, VUNO Med-BoneAge was used as an AI medical device, and its principle was based on deep learning to find the atlas of the most statistically similar BA and to provide the final value to the first decimal place through the matching rate.
The average adult height predicted by a specialist was 174.8 cm for male, 159.3 cm for female, and by AI was 175.8 cm for male,160.5 cm for female. The FAH surveyed by phone was 173 cm for male and 160.5 cm for female. When subjects were divided by sex, the BA and PAH values differed significantly to the FAH and PAH values in both the specialist and AI groups, especially in male. This difference was smaller in female (
Tables 1,
2).
When comparing the specialist and AI's Bland-Altman plot, 93% of the GP method (mean±1.96 standard deviation [SD]=-6.88, 6.96) and 78% of AI (mean±1.96 SD=-5.56, 3.42) were within the agreement limits, so the predictive accuracy of the specialist was 93%, and AI was 78% (
Fig. 1).
However, when the subjects were divided by sex and puberty, the P value of the difference between PAH and FAH was not statistically significant in pubertal male and prepubertal female for specialists and prepubertal female for AI.
The largest difference was observed in prepubertal and pubertal male, in the case of specialists and AI, respectively. Other studies comparing BA using another AI (BoneXpert, Hørsholm, Denmark) showed that both male and female tended to measure BA younger in prepubertal age (male, 0.001–0.61 years of age; female, 0.02–00.76 years of age), and older in pubertal age (male, 0.43–1.64 years of age; female, 0.03–1.24 years of age), and the study also showed the greatest difference in pubertal male [
1,
6].
In most studies, both specialists and AI have high predictive rates when measuring FAH in female, which is presumed to be because most children visiting the growth clinic are female who are worried about precocious puberty [
7,
8]. According to the report, the prevalence of precocious puberty in Korea is 40 times higher in female than in male, and thus, both specialists and AI have more experience in measuring female growth plates, which makes it possible to predict more accurately [
9].
In this study, the Bland-Altman plot was used to evaluate the accuracy of the prediction. Other studies related to BA and PAH also used the Bland-Altman plot for accuracy comparison with each method or AI. Jeong et al. [
10] confirmed that the difference between the expected and FAHs calculated using the BP method falls into the limits of agreement; Kim et al. [
2] showed that most values are located within the limits of agreement in specialist’s predictions and BoneXpert predictions, and there are not many differences between the 2 methods.
The limitation of this study is that only about 40% of all respondents answered the FAH by phone because the call was made 10 years after the outpatient visit. Therefore, a selection bias may occur because a group that the FAH does not reach the PAH is more likely to be excluded by themselves, and due to the nature of the telephone survey, the given height value can be larger than the actual measured value [
4,
5]. Specialist’s BA and AI predictions were performed in 2012 and 2021, respectively, therefore the latter could have an advantage.
In addition, the study compared the results of BA with one pediatric endocrinologists and AI, and most of the current studies are single-center studies, so later studies should include pediatric endocrinologists and results in multiple centers.
Along with the AI medical devices, doctors, not pediatric endocrinologists, rely on AI to diagnose growth problems only with BA, and accordingly, there are cases where proper evaluation and treatment of diseases are delayed or missed.
In conclusion, in this study, the accuracy of the PAH assessed by a specialist was 93%, while that assessed by AI was 78%. This suggests that AI prediction may still need a monitoring by specialists.
This paper is conducted on a small group randomly selected from a single center, and it should not be interpreted as AI can be a tool to predict BA or PAH on behalf of pediatric endocrinologists, and AI companies should avoid using the paper commercially.