
P-R curve analysis
In order to compare the prediction of CenterNet, RetinaNet, EfficientDet, Faster R-CNN, YOLOv4, YOLOv5, YOLOv7, and YOLOX models on tomato ripeness and fruit peduncle in the P-R curves, the predictions of the P-R curves predicted by the above models for the test set (in this paper, a total of 200 images in the test set are used, comprising 149 green, 62 half, 231 red, and 30 stem classes) are plotted separately. The larger the area under the line of the P-R curve, the better the model effect. The results are shown in Fig. 6 below.
P-R curves for different models on each class. (a) P-R curve for the “red” category; (b) P-R curve for the “half” category; (c) P-R curve for the “green” category; (d) P-R curve for the “stem” category.
For the P-R curves analysis of the red ripe tomato category (red), it can be seen that, except for YOLOv5, CenterNet and Faster R-CNN models, all the other models are located above the coordinates. Although the difference is small, before the Recall value of 0.9, the YOLOX model curves are occupying the uppermost position, so the area under the line of its curves is larger than the other models that is more applicable to the category.
The analysis of the P-R curve for the half-ripe tomato category (half) shows that the YOLOX, YOLOv7 and YOLOv4 models have overlapping P-R curves before the Recall value of 0.4, but after the Recall value of 0.4, the curves of the YOLOv7 and YOLOv4 models begin to gradually decline, while the YOLOX model still occupies the uppermost position. A comprehensive analysis shows that the YOLOX model has a larger area below the curve line and is more suitable for this category.
The analysis of the P-R curve for the green ripe tomato category (green) shows that, except for Faster R-CNN, RetinaNet and YOLOv5, all the other curves are closer to the upper right of the axis and the gap is smaller. However, when the Recall value is 0.9, the YOLOX model is more suitable for this data category because the curve of YOLOX model decreases slower than the other models and the curve position is closer to the top, so the area under the line is slightly larger than the other models.
The analysis of the P-R curve graph for the tomato fruit stalk category (stem) shows that, except for the CenterNet model, the Precision value of each model decreases as the Recall value increases, with the relatively higher curves of the Faster R-CNN, RetinaNet and YOLOX models. So, for this category, the overall gap between the above three models is small and all of them can be used to recognize this category.
Combining the above P-R curve analysis of each tomato ripeness and fruit stalk category, the YOLOX model has a better overall effect in multi-category recognition compared to the other models, and is more suitable for the detection and recognition of tomato ripeness and fruit stalk dataset.
Comparison of recognition results of YOLO series models
To assess the recognition performance of the YOLO series of models on the tomato dataset, this study employs four distinct network models, specifically YOLOv4, YOLOv5, YOLOv7, and YOLOX, to train the aforementioned training set comprising 640 images. The resulting experimental comparison outcomes are summarized in Table 2.
The comparison results in Table 2 show that the mAP value of YOLOX model on the tomato dataset is 8.12%, 21.04%, and 5.41% higher than that of YOLOv4, YOLOv5, and YOLOv7 models, respectively, and the YOLOX model on the “half” unbalanced class is 17.21, 24.98, and 3.29% higher than that of the other models, respectively, and the YOLOX model has the highest F1 value of 85.25. The YOLOX model integrates several advanced mechanisms that distinguish it from other model series. YOLOX adopts an anchor-free design, simplifying the model architecture and reducing computational complexity, thereby enhancing the model’s flexibility in handling objects of varying scales and aspect ratios. The decoupled head design improves the model’s ability to accurately classify and locate objects. Additionally, the integration of SimOTA (Simplified Optimal Transport Assignment) further optimizes the matching strategy between predicted boxes and ground truth, thereby improving training efficiency and detection performance. It can be seen that after applying the aforementioned three mechanisms, YOLOX model has significantly improved the detection rate and accelerated the model convergence, which is more stable than the other models as a whole. Therefore, in this paper, the YOLOX model is chosen as the main algorithm model for this experiment, and on this basis, it is improved to enhance the accuracy of tomato different ripeness and fruit stalks identification.
Comparison results of YOLOX model optimization
Results and analysis of experiments on the mechanism of adding attention
In order to improve the model’s accurate recognition rate of unbalanced samples, five attention modules, SE, ECA, BAM, CBAM, and NAM, were added to the end of the neck network of YOLOX model respectively for experiment, using the test set of 200 images, to validate the effect of different attention mechanisms on the recognition accuracy of the model. The comparison of the model P-R curves after adding the attention modules is shown in Fig. 7 below.

P-R curves of YOLOX models with different attention modules on each class. (a) P-R curve for the “red” category; (b) P-R curve for the “half” category; (c) P-R curve for the “green” category; (d) P-R curve for the “stem” category.
The analysis of the P-R curve graphs for the red ripe tomato category (red) shows that the curves of the models are located in the upper right of the coordinates, with a smaller gap, except for the YOLOX+CBAM model, which is more obviously below the other models, but the YOLOX+SE model’s P-R curves occupy the uppermost position until the Recall value of 0.0-0.6, so that the area under the line of the curve of the YOLOX+SE model is larger than that of the other models and more applicable to the category.
The analysis of the P-R curve for the half-ripe tomato category (half) shows that before Recall values of 0.25, although the P-R curves of YOLOX+BAM are higher than those of other models, after Recall value of 0.4, the curves of the YOLOX+SE model are progressively higher than those of YOLOX+BAM and the curves of YOLOX+BAM drops faster than that of YOLOX+SE model. The comprehensive analysis shows that the curve of YOLOX+SE model has a larger area below the line and is more suitable for this category.
The analysis of the P-R curve for the green ripe tomato category (green) shows that all the curves are located directly above the axes until the Recall value of 0.8, except for the YOLOX curve, which is significantly lower than the other models after the Recall value of 0.4. At Recall value of 0.9, except for the YOLOX+SE model curve which is still located at the top, all the other curves are decreasing to different degrees, and the YOLOX+SE model decreases slower than the other model curves, so the area under the line is slightly larger than the other models, so it is more applicable to this data category.
For the tomato fruit stalk category (stem) the analysis of the P-R curve shows that all the curves except YOLOX, YOLOX+CBAM and YOLOX+BAM are clearly located in the coordinates directly above and almost overlap with a small gap. So, for this category, the overall gap between the three models YOLOX+SE, YOLOX+NAM and YOLOX+ECA is small, and all of them can be used to identify this category.
Combining the above P-R curve analyses for each tomato ripeness and stalk category, the gap between each curve is smaller in the red, green and stem categories, which is due to the larger proportion of the data itself and the clearer target characteristics. However, in the half category with a smaller number of samples, the advantage of the YOLOX+SE model is more obvious, so the YOLOX+SE model has a better integrated effect in multi-category identification compared to other models, and is more suitable for the detection and identification of tomato ripeness and fruit stem datasets.
Attention module comparison results
To assess the impact of various attention mechanisms on model recognition accuracy, five attention modules–SE, ECA, BAM, CBAM, and NAM–were integrated at the end of the neck network of the YOLOX model for experimentation. The results are displayed in Table 3:
The YOLOX model enhanced with the SE attention module shows an increase in mean Average Precision (mAP) by 0.92% over the original model, with an F1 score of 86.00%. Additionally, the Precision and Recall have improved by 2.97% and 0.82%, respectively, compared to the original model. It indicates that the SE attention module enhances the half category precision and also strengthens the relationship between the features, so that the overall precision of the model is improved. Conversely, the inclusion of BAM, NAM, CBAM, and ECA attention modules resulted in a decrease in mean average precision by 0.41%, 0.08%, 2.94%, and 1.04%, respectively, relative to the original model. Other performance metrics also declined to varying extents. This suggests that spatial attention mechanisms, while adding more parameters, might overlook crucial information of specific categories within complex images, thereby impairing the model’s performance.
The analysis of individual category average precision (AP) from Table 3 reveals that the SE attention module enhances the model’s focus on useful channel information by learning adaptive channel weights. This adjustment allows the model to better capture the complex interrelations between channels, improving performance across all four classes samples. Notably, the detection and identification precision for the imbalanced sample reached 88.03%, marking an improvement of 2.38% over the original model. With the addition of the SE attention module, the YOLOX model’s accuracy and stability for imbalanced samples are significantly enhanced, making it more effective in identifying varying ripeness levels and stalks of tomatoes.
Experimental results and analysis of loss function optimization
In order to select the most suitable loss function for this paper’s dataset and improve the robustness and overall detection accuracy of the model, this experiment re-places and compares the three loss functions of DIoU, CIoU and GIoU loss with the IoU loss in YOLOX, using the test set of 200 images, and the comparison of the model’s P-R curves after replacing the loss functions is shown in Fig. 8 below.

P-R curves of YOLOX models with different loss functions on each class. (a) P-R curve for the “red” category; (b) P-R curve for the “half” category; (c) P-R curve for the “green” category; (d) P-R curve for the “stem” category.
The analysis of the P-R curve for the red ripe tomato category (red) shows that the YOLOX model curve is clearly located below the other models and the gap between the rest of the models is smaller. After the Recall value of 0.9, the YOLOX-GIoU model curve occupies the uppermost position, so its area under the curve line is larger than the other models and is more applicable to this category.
The analysis of the P-R curve graphs for the half-ripe tomato category (half) shows that the models show different degrees of decline after a Recall value of 0.2, and the P-R curves of the models remain coincident. However, at the Recall value of 0.5, there is a gap between the model curves. From the Recall value of 0.6 to 0.85, the YOLOX-GIoU model always keeps the top position. Comprehensively analyzing the whole curve, it can be seen that the YOLOX-GIoU model’s curve maintains the uppermost position more than the other models, so it has a larger area under the line, which is more suitable for this category.
The analysis of the P-R curve for the green ripe tomato category (green) shows that at Recall values of 0-0.5, the P-R curves of the models remain coincident, after which the YOLOX model decreases compared to the other models. At Recall value of 0.9, the YOLOX-GIoU model remains at the top and all other models start to decrease. So YOLOX-GIoU model is more suitable for this category.
For the P-R curve analysis of the tomato fruit stalk category (stem), it can be seen that the Precision value of each model is decreasing with the increase of Recall value, in which the YOLOX-GIoU model curve is relatively higher than the other models, and the rate of decrease is more moderate. So, for this category, YOLOX-GIoU model is more applicable.
Combining the above P-R curve analyses for each tomato ripeness and fruit stalk category, the YOLOX-GIoU model exhibits superior multi-category recognition capabilities, making it particularly well-suited for detecting and recognizing varying stages of tomato ripeness and fruit stalks.
Loss function comparison results
To identify the most effective loss function for this study’s dataset and to enhance the model’s robustness and overall detection accuracy, this experiment compared DIoU loss, CIoU loss, and GIoU loss against the standard IoU loss in YOLOX. The comparative results are displayed in Table 4:
As can be seen from Table 4, under the same conditions, the use of the GIoU loss function improves the mAP value by 0.44% and the F1 value by 1.00% with respect to the original model. It shows that in this experimental dataset, GIoU loss can better reflect the model’s overlap between the predicted box and the ground truth box, while also focusing on the differences between fruits and fruit stalks. This improves the recognition accuracy of each category and increases the overall stability of the model. Replacing the loss function with DIoU loss and CIoU loss also has some improvement over the original model, which are 0.07% and 0.35%, respectively, where DIoU loss is lower than the original model by 1.86%, 2.35%, and 2.00% in terms of model detection accuracy, recall, and F1 value, respectively, which indicates that after using DIoU loss and CIoU loss, the convergence speed of the regression process is faster, but leads to the model recognition performance decreases. CIoU loss has an improvement over the DIoU loss after considering the aspect ratio of the bounding box, but the overall performance is lower than that of GIoU loss.
In summary, when replacing the loss function of the model with GIoU loss com-pared to the original YOLOX model, the problem of low recognition accuracy caused by the large difference in the target scales of the fruit and the fruit stalk is improved, indicating that this loss function is more applicable to the tomato dataset of this experiment.
Ablation experiment
To reflect the effect of adding both SE Attention Module and GIoU loss on the performance of the YOLOX model, comparative validation was performed using ablation experiments on the test set of 200 images.
A comparison of the model P-R curves is shown in Fig. 9 below.

P-R curves of all models in each category under ablation experiment. (a) P-R curve for the “red” category; (b) P-R curve for the “half” category; (c) P-R curve for the “green” category; (d) P-R curve for the “stem” category.
The analysis of the P-R curve for the red ripe tomato category (red) shows that all model curves are located above the coordinates and the gap is small, but after the Recall value of 0.8, the YOLOX-SE-GIoU model curves all occupy the uppermost position, so that the area under the line of their curves is larger than that of the other models, which is more applicable to this category.
The analysis of the P-R curve for the half-ripe tomato category (half) shows that before the Recall value of 0.2, the YOLOX-GIoU model is located at the top compared to the other models. After the Recall value of 0.2, the YOLOX-SE-GIoU model’s curve is always located above the other models, and a comprehensive analysis shows that the YOLOX-SE-GIoU model has a larger area under the curve line, which is more suitable for this category.
The analysis of the P-R curve for the green ripe tomato category (green) shows that the YOLOX model curve is clearly lower than that of other models, with the smallest area below the line. Throughout the later stages of the curve, the YOLOX-SE-GIoU model, on the other hand, declines more slowly than the other model curves and contains a larger range. It is therefore more applicable to this data category.
The analysis of the P-R curve for the tomato fruit stalk category (stem) shows that the YOLOX-GIoU and YOLOX+SE model curves are located at the top of the coordinates until the Recall value of 0.2. Between Recall values of 0.2-0.7, the YOLOX-GIoU model curve is located at the uppermost coordinate. However, after the Recall value of 0.7, the YOLOX-SE-GIoU model P-R curve is higher than the other models and decreases more slowly than the other models. So, for this category, although the gap between the above models is small in the early stage, the curves all start to decrease in the later stage with the increase of Precision value, and the improved YOLOX-SE-GIoU model decreases more slowly, which indicates that the improved model is more stable and has better performance.
Combined with the above analysis of P-R curves for each tomato ripeness and fruit stalk category, the improved YOLOX-SE-GIoU model has a better overall effect in multi-category recognition compared to other models, and is more suitable for the detection and recognition of tomato ripeness and fruit stalk datasets.
Comparison of ablation performance of YOLOX model
To reflect the combined effect of adding the SE attention module and GIoU loss on model performance, an ablation study was conducted for comparative validation. The experimental results are shown in Table 5.
As can be seen from Table 5, adding the SE attention module enhances the model’s ability to focus on salient features specific to tomatoes, such as color variations and shape details, by adaptively recalibrating channel-wise feature responses. This improves feature discriminability and results in a mAP increase of 0.92%. In this study, the SE attention module adaptively recalibrates channel-wise feature responses, enhancing the model’s focus on significant characteristics of tomatoes, including color variations and shape details, thereby improving feature discrimination and leading to a mAP increase of 0.92%. In order to obtain a more accurate prediction box and reduce the problem of accuracy degradation caused by different target scales, the loss function in YOLOX is replaced with GIoU loss function, which provides a better measure of the overlap between predicted and ground truth boxes, especially for small and elongated structures like tomato stems. This leads to a mAP improvement of 0.44%. GIoU loss offers a more comprehensive evaluation of the overlap between predicted and ground truth bounding boxes, particularly improving accuracy in cases involving small and elongated structures such as tomato stems, thereby enhancing the mAP by 0.44%. In order to improve the overall performance and stability of the model, the two optimization strategies are fused, and the final optimized YOLOX-SE-GIoU model benefits from both the enhanced feature discrimination for tomato characteristics and more accurate bounding box regression for stems, improving the APs of the unbalanced sample “half” and the small-scale target “stem” by 1.88% and 3.78%, respectively, compared with the original model; the mAP and the F1 value of the single optimization strategy are also improved to a certain extent, which effectively reduces the phenomena of miss detection and false detection in tomato identification. In addition, we evaluated the inference speed using the FPS metric, and the FPS of the YOLOX-SE-GIoU model is about 45, which basically meets the real-time processing requirements of embedded harvesting devices.
In summary, YOLOX-SE-GIoU was confirmed as the final model for this experiment, and it was applied to the tomato test set to verify the effect.
Model detection effectiveness analysis
A comparison of the testing effect of YOLOX-SE-GIoU model on the test set of 200 images is shown in Fig. 10. From the figure, it can be seen that under direct lighting, a half-ripe tomato is misclassified as a green ripe tomato in Fig. 10a; under backlighting, a red ripe tomato is misclassified as a half-ripe tomato in Fig. 10b, while a half-ripe tomato is misclassified as both a red ripe and a half-ripe tomato; due to occlusion from leaves, a red ripe tomato is misclassified as both half-ripe and red ripe in Fig. 10c; due to occlusion by an object, the fruit stalk of a green ripe tomato is missed in Fig. 10d. Although the improved model fails to detect this fruit stalk, its overall confidence level is higher than that of the original YOLOX model. Additionally, the fruit stalk of a red ripe tomato is not recognized due to its color similarity with the green ripe tomato. The short fruit stalk of a green ripe tomato is missed due to occlusion by leaves, and the half-ripe tomato is misclassified as a red ripe tomato in Fig. 10e. Figure 10i further shows that the improved model still experiences some missed detections in cases of partial occlusion and overlap, an area for future improvement. In conclusion, as shown in Fig. 10f–h,j, the YOLOX-SE-GIoU model is not affected by occlusion, lighting, distance, and other factors, and can accurately identify tomatoes at different ripeness levels and their fruit stalks, with an improved confidence level compared to the original model.

Comparison of visual results of YOLOX-SE-GIoU(f–j) and YOLOX(a–e) tomato fruit and stem detection.