In this section, we elaborately explained data acquisition, preparation, and preprocessing and proposed Bayesian optimized hybrid methodology. A comprehensive illustration of the proposed methodology is shown in Fig. 1.
Dataset acquisition
A large amount of representative and authentic data is required to construct effective and optimized crop disease classification and detection models. Machine learning and deep learning systems perform better when fed a large volume of data. In this proposed methodology, we used a diverse, open-access, research-oriented ‘plantvillage’ dataset70,71 for tomato leaf disease classification. It is a versatile dataset that contains more than 54,000 images of 14 types (i.e., fruits and vegetables) of crops. This expert’s contributed dataset has images in color, segmented, and grayscaled variations. For our research, we used 18,159 colored and labeled images of tomato leaves divided into 10 different classes (16,568 images for 9 disease categories, i.e., Bacterial spot: 2127, Early blight: 1000, Late blight: 1908, Leaf mold: 952, Septoria leaf spot: 1771, Target spot: 1404, Tomato leaf mold (TMV): 373, Two-spotted spider mites (TSSM): 1676, Tomato yellow leaf curl virus (TYLCV): 5357, and 1591 images for the Healthy class). Some of the images were blurry, noisy, and had uneven lighting and low contrast. The sample images of the tomato leaf diseases are depicted in Fig. 2.
The following are the signs and symptoms of the nine tomato leaf diseases:
Target spot is a result of the fungal pathogen Corynespora cassiicola. Initial symptoms include small, irregularly shaped lesions, which later develop into larger, ring-shaped lesions. This leads to the yellowing and necrosis of the leaves.
Tomato yellow leaf curl virus (TYLCV) is transmitted by silverleaf whitefly (Bemisia tabaci). TYLCV infection induces severe symptoms in tomato plants, including yellowing of leaves, curling of shoots, and shortened shoots. This type of infection can result in considerable yield reductions of up to 100%.
Tomato late blight is attributed to the oomycete pathogen Phytophthora infestans (P. infestans). Symptoms of late blight on tomato leaves and fruit are characterized by irregularly shaped, water-soaked lesions and fruit rot, distinguishing it from early blight.
Early blight of tomatoes, mainly caused by the fungus Alternaria linariae (= A. tomatophila, previously known as A. solani), affects the foliage, fruit, and stems of tomato plants. It initially shows up as lesions on the lower leaves and can eventually spread to the entire fruit.
Bacterial spot, caused by Xanthomonas vesicatoria, Xanthomonas euvesicatoria, Xanthomonas gardneri, and Xanthomonas perforans, manifests as brown, circular, water-soaked spots on the leaves. Under favorable conditions, these spots can merge to form long, dark streaks.
Tomato leaf mold is initiated by a fungal pathogen called Passalora fulva (also known as Cladosporium fulvum). The symptoms of this disease appear as light green spots on the leaves, which later develop into olive-green conidia and necrotic spots.
Tomato mosaic virus is transmitted from one plant to another by several species of aphids. Infected plants exhibit a range of symptoms, such as irregular fruit shapes, fruit lesions, and reduced fruit size. Other symptoms include deformed growth points, unusual leaf coloration, shapes, patterns, twisted stems, and overall plant distortion and stunting.
Two-spotted spider mites cause tan or yellow, crusty-textured symptoms on the undersides of affected leaves.
Septoria leaf spot is caused by the fungus Septoria lycopersici. Initial symptoms include small, water-soaked, circular spots on the undersides of older leaves, which then develop gray to tan centers with a dark-brown margin.
Data preparation and preprocessing
An essential step in the pipeline of the disease classification model is image preparation and pre-processing. Since the raw images might vary in size, contain noise, or have uneven illumination, pre-processing is necessary before applying them to any deep learning models72. Additionally, proper preprocessing of data impacts the model’s performance. First, we renamed the lengthy class labels into an easily readable format (i.e., ‘tomato_yellow_leaf_curl_virus’ into ‘TYLCV’, ‘tomato_mosaic_virus’ into ‘TMV”, and ‘two-spotted_spider_mites’ into ‘TSSM’).
In the second stage of data preparation and preprocessing, we distributed the size of the classes available in the dataset. For the class-specific data preparation, we applied two popular techniques. These include data augmentation and random downsampling. The detailed descriptions of the adopted approaches are described in “Data augmentation” and “Random downsampling” sections.
Dataset splitting
The original dataset was divided into three sub-datasets: training, validation, and testing datasets. The division was executed by randomly selecting 70% of the data for training purposes, allocating 15% for validation, and reserving another 15% for testing. The training dataset was used for training purposes, and the validation dataset was used to assess the models’ performance after training. A validation dataset is also needed for evaluating the tuned hyperparameters in order to improve the models’ performance. Once the models were verified by the validation dataset, the testing dataset was utilized for testing. The comprehensive description of these three datasets is presented in Tables S1 and S2. These three datasets were stored locally in three folders, and each folder was divided into ten subfolders with the names of the disease classes. The folder arrangement is graphically depicted in Fig. S1.
Data augmentation
As some of the classes possessed a higher number of images than others, an imbalance problem arose in the training dataset. Training a deep learning model using an imbalanced dataset can result in poor generalization. This can occur because the model may become too biased toward the majority classes and fail to perform well for the minority classes. In order to prevent the issue of an imbalanced dataset, a widely used method known as data augmentation was applied. Data augmentation serves as a regularization strategy to avoid overfitting and enhance the model’s robustness by introducing variability in the dataset. This also ensures that the model performs more accurately when categorizing real-life plant leaf diseases73. However, the main aim of data augmentation is to increase the dataset’s size artificially74. In this study, we utilized data augmentation for nine classes using the ‘ImageDataGenerator’ class from Keras to generate augmented images from the original images to balance the training dataset by increasing the number of images in those classes. This process correctly preserved the labels on augmented images. We maintained the images in the range of 1400 to 2130 for different classes to create a balanced dataset, although not perfectly balanced since perfectly balanced datasets are rare in real life. The number of augmented images was determined by setting the required number of image samples per class and then subtracting the image numbers already present in the training dataset. We increased image samples for the ‘Bacterial spot’, ‘Early blight’, ‘Healthy’, ‘Late blight’, ‘Leaf mold’, ‘Septoria leaf spot’, ‘Target spot’, ‘TMV’, and ‘TSSM’ classes from 1489, 700, 1114, 1336, 667, 1240, 983, 262, 1174 to 2127, 1600, 1591, 1909, 1852, 1771, 1404, 1990, and 1676, respectively, by using different augmentation techniques. The implemented augmentation techniques are rotation_range = 30, width_shift_range = 0.1, height_shift_range = 0.1, shear_range = 0.1, zoom_range = 0.2, horizontal_flip = True, fill_mode = ”reflect”, brightness_range = (0.5, 1.5).
The augmented images were saved with the “.jpg” extension in the augmented folder, and the ‘aug_’ prefix was used to differentiate the original training images from the augmented images (Fig. S2). Then the original and augmented images were merged to generate the final training dataset. The distribution of the images is illustrated in Fig. S3.
Random downsampling
Random downsampling is the technique of removing data samples from the majority class at random in order to prevent them from dominating the learning algorithm. In machine learning, it is frequently used to balance the dataset so that the training data is more representative and consumes less memory. In the “TYLCV’ class, the image samples were comparatively higher than in other classes in the original training dataset. So, for this class, we downsampled images from 3750 to 2000 by randomly selecting 2000 images using the ‘random.sample’ function of the Python ‘random’ module.
The following Algorithm 1 depicts the process of data augmentation and random downsampling.
Resizing and rescaling
Resizing operations in deep learning return homogeneous images by downscaling the input images. The performance and training time of deep learning models are significantly influenced by image resolution75,76,77,78. In general, CNNs operate with image resolutions that are typically low to mid-level, commonly ranging between 64 × 64 and 256 × 25676. Several studies have shown that the deep learning model performs better at 256 × 256 image resolution than at other resolutions76,77. Therefore, to resize the images into a target resolution of 256 × 256 pixels, we utilized the ‘flow_from_directory’ method from the ‘ImageDataGenerator’ class from the ‘Keras’ API. After resizing, we applied a rescaling (also called normalization) operation to the images using ImageDataGenerator (rescale = 1/255) from the tf.keras.preprocessing.image module. With this rescaling technique, the input images’ pixel values were rescaled from the Numpy arrays of [0, 255] to the Numpy arrays of [0, 1]. Numpy arrays are a memory-efficient, faster data structure for numerical computation, model training, and evaluation. These numpy arrays were passed as input to the custom CNN model for feature extraction.
Proposed deep hybrid learning (DHL) approach
In the proposed Bayesian optimized deep hybrid learning approach, we employed deep learning for feature extraction and machine learning for feature categorization. A custom CNN model was used as the deep learning framework, and seven advanced supervised learning algorithms were applied as the machine learning framework. The hyperparameters of the deep learning and machine learning models were optimized using the Bayesian optimization approach.
Bayesian optimization
Machine learning algorithms require precise adjustment and optimization of learning parameters and hyperparameters. Hyperparameters (e.g., the kernel in SVM, n_estimators in a random forest, n_neighbors in KNN; learning rate, hidden layers, etc. in a neural network) are those technical terms that we can alter according to the nature of the dataset and the specific problem at hand. Mathematically, hyperparameter optimization can be represented79 as \({\mathbf{x}}^{\mathbf{{\prime}}}\)=argmin f(x), where x \(\in\) domain X, f(x) = objective function (i.e., loss), \({\text{x}}^{{{\prime}}}\) hyperparameters that give the lowest value of f(x). There are various techniques available to optimize the hyperparameters. Sequential Model-based Optimization (SMBO), also known as Bayesian optimization, is considered to be the most promising approach when compared to other existing hyperparameter tuning methods such as manual search, grid search, and random search. Manual search takes a lot of time, effort, and domain knowledge to locate the ideal hyperparameters. Grid search and random search are also time-expensive processes compared to Bayesian optimization42. Bayesian optimization (BO) is a reliable iterative method that is well-known for its effectiveness since it uses previous results to inform future evaluations. It is particularly effective for optimizing costly black-box functions with multiple external dependencies. BO’s effectiveness hinges on two primary components: surrogate models and acquisition functions. The surrogate model mimics the objective function using a probability distribution, and the acquisition function directs the search by balancing exploration and exploitation. Exploration involves investigating the unexplored region, whereas exploitation focuses on sampling locations where the surrogate model predicts a high objective value. For Bayesian optimization, Gaussian processes (GP) are the popular surrogate models since they are simple to use, easily adaptable to new data, and offer a confidence level for each of their predictions. The Gaussian process model generates a probability distribution over possible functions. A mean function, or the average appearance of these possible functions, and a kernel function, or the degree to which these functions change across inputs, define this distribution. On the other hand, the commonly used acquisition functions are GP Upper Confidence Bound (GP-UCB), Probability of Improvement (PI), and Expected Improvement (EI).
The Bayesian optimization relies on constructing and updating a probabilistic model of the objective function, which can be computationally intensive80, especially when approaching high-dimensional or noisy functions. Additionally, finding the accurate balance between exploration and exploitation is crucial for effective Bayesian optimization81. Zoubin Ghahramani82 noted that the current most effective global optimization methods uphold a Bayesian representation of the probability distribution over the uncertain function ‘f’ undergoing optimized and utilize this uncertainty to determine the next query location within the search space ‘X’. Hence, despite its limitations, Bayesian optimization is widely favored in professional settings due to its efficient optimization of costly evaluations using probabilistic models. The present study implemented Bayesian optimization (a Gaussian process-based surrogate) during the deep learning phase, utilizing the KerasTuner framework. The optimization was performed using 100 trials, and the optimal hyperparameters were obtained from the trial with the highest validation accuracy.
In addition to the conventional Bayesian implementation, the study considered one of the advanced Bayesian optimization techniques called Tree-structured Parzen Estimator for the optimization of machine learning models.
Tree-structured Parzen Estimator
TPE83, or Tree-structured Parzen Estimator, represents a sophisticated variant of the Bayesian optimization method that employs the Gaussian mixture model for learning model hyperparameters. It considerably outperforms random search techniques regarding optimization efficiency84 and finds better hyperparameters in the same number of trials with low trial time79. Furthermore, with the TPE approach, the computing time for each iteration scales linearly, whereas for the Gaussian process, it scales cubically.
TPE diverges from the conventional Bayesian optimization approach by modeling P(x|y), which signifies the probability of the hyperparameters (x) given the value (y) of the objective function. In TPE, the algorithm transforms the configuration space, replacing uniform with truncated Gaussian mixtures, log-uniform with exponentiated truncated Gaussian mixtures, and categorical variables with re-weighted categorical. By incorporating different observations {x1,…, xk} into the non-parametric densities, these substitutions represent a learning algorithm capable of generating various densities across the configuration space.
TPE maintains two densities (Eq. 1) based on a threshold value (y*). This threshold divides the hyperparameter search space into two distinct categories: the “good” category, denoted as l(x), and the “bad” category, denoted as g(x).
$$\text{P}\left(\text{x}|\text{y}\right)= \left(\begin{array}{ll} {\text{l}}\left(\text{x}\right); & \quad {\text{if}} \;\; {\text{y}} < {\text{ y}}^{*}\\ {\text{g}} \left(\text{x}\right); & \quad {\text{if}} \;\; {\text{y}} \ge {\text{ y}}^{*}\end{array}\right)$$
(1)
where l(x) represents the density consisting of observations xi such that the corresponding loss f(xi) is less than the threshold (y*) and the g(x) represents the density comprising the remaining observations. The TPE algorithm relies on the threshold y* that exceeds the best observed f(x) to allow for the formation of l(x) using certain points. In this approach, y* is selected as some quantile \({\varvec{\gamma}}\) of the observed y values, ensuring that p(y < y*) = \(\upgamma\).
The parameterization in the TPE algorithm is specifically designed to streamline the optimization of Expected Improvement (EI). As indicated in the study by Bergstra et al.83, the expected improvement in the TPE algorithm can be represented as
$$\text{EI}_{\text{y}*}(\text{x})\propto {\left(\gamma +\frac{g(x)}{l(x)} (1-\gamma )\right)}^{-1}$$
This expression indicates that, in order to maximize improvement, we aim for points x that have a high probability under ℓ(x) and a low probability under g(x). The tree-like structure of ℓ and g simplifies the process of drawing numerous candidates based on ℓ and assessing them based on g(x)/ℓ(x). During each iteration, the procedure selects the candidate x∗ with the highest expected improvement.
For utilizing the TPE, we used the Hyperopt85, which is a hyperparameter optimization library consisting of 4 distinct phases, such as (a) search space or domain space, which includes the hyperparameters to be tuned and their corresponding value ranges (categorical, integer, or float); (b) objective function, that is, a loss function to minimize; (c) search algorithm; and (d) trails database. Hyperopt is supported by the SMBO methodology adapted to work with the TPE algorithm83.
The configurations of the proposed Bayesian optimization procedure are thoroughly presented in Algorithm 2.
Hyperparameter selection procedure
The hyperparameter selection procedure adopted for this study is divided into several steps. The steps are described below.
-
First, we trained the models using the default hyperparameter settings to observe their performance. Default hyperparameters may not exploit the full potential of the model for a given problem. Additionally, an unoptimized learning rate might lead to slow convergence or overshooting. In most cases of this study, the default values of hyperparameters did not provide sufficient regularization, leading to overfitting, and in some cases, they provided too much regularization, causing underfitting.
-
Next, we tuned the hyperparameters manually to get an idea of the best possible range of hyperparameters. This manual tuning approach slightly reduced overfitting issues.
-
Finally, we applied Bayesian search to tune the hyperparameters. The manual hyperparameter tuning helped in selecting the proper interval of hyperparameters for the Bayesian search space. The Bayesian search space searched for the best set of hyperparameters for the employed models. We achieved the best hyperparameters by combining Bayesian search with manual search results.
All hyperparameter tuning strategies were implemented on the training dataset and validated using the validation data. The best set of hyperparameters obtained by the validation was then used for the final training of the models.
Architecture of the proposed Bayesian optimized CNN Model
Convolutional Neural Network86 (CNN) is a prominent deep learning method that utilizes convolution operations to extract distinct features from input images and classifies them using fully connected layers or dense layers with activations. Several layers, including the convolutional, max pooling, and average pooling, are used in feature extraction. An output layer with the requisite number of classes is added as a final layer. The proposed CNN architecture consists of a series of convolution, batch normalization, and max pooling layers. It also includes flatten, dropout, and dense layers. We used the Bayesian optimization technique for tuning the hyperparameters of the CNN model (i.e., number of 2D convolutional layers, number of filters, kernel size, activation function, regularization on kernel weights, number of MaxPooling2D layers, pool size, etc.). The Bayesian optimized hyperparameters for the proposed CNN is shown in Table S3.
Convolution2D
A 2D convolution layer, or conv2D, is the most common type of convolutional layer. It performs convolution operations on the 2D input image and passes the convolution result to the succeeding layer. It consists of several filters, or kernels. The kernel convolves horizontally and vertically over the entire image, performs the dot product with the input image, and outputs a scalar value each time. Such a combination of scalar values is called a feature map. This way, it decreases the input 2D image size and extracts the relevant features. The primer conv2D layer extracts low-level features such as edge, texture, etc., whereas the final conv2D layer takes the output from the earlier layer and extracts more task-specific, high-level, and complex features. Stride and padding are the two common elements of the conv2D layer that affect convolution operations. Stride detonates the step size of the conv2D window, and padding impacts the spatial dimension of the image. According to Bayesian hyperparameter tuning, our custom CNN model includes four conv2D layers with filters of 32, 64, 16, and 64, respectively. The kernel size is 2 × 2 (3 × 3 for third conv2D), which corresponds to the width and height of the conv2D frame. We used “same” padding, so after a convolution, the output size remains the same as the input size. Total padding amount on rows and columns can be determined using Eq. (2). We also set the stride value to 1, so after elementwise multiplication of the filter with the image, the filter window moves across the image both vertically and horizontally by 1 pixel. The output of a CNN layer considering “same” padding and stride can be found by using Eq. (3).
$$\left(\begin{array}{l}{\text{P}}_{\text{r}} \, \text{=} \, \left({\text{R}}^{\prime}\text{- 1}\right) \, \text{*} \, {\text{S}} \, \text{+} \, \left({\text{K}}_{\text{r}}\text{ – 1}\right)\text{*} \, {\text{D}}_{\text{r}} \, \text{+} \, {1} \, – \, {\text{R}}\\ {\text{ P}}_{\text{c}} \, \text{=} \, \text{(}{\text{C}}^{\prime} \, – \, {1}\text{)} \, \text{*} \, {\text{S}} \, \text{+} \, \text{(}{\text{K}}_{\text{c}} \, – \, {1}\text{)} \, \text{*} \, {\text{D}}_{\text{c}} \, \text{+} \, {1} \, – \, {\text{C}}\end{array}\right)$$
(2)
where (\({\text{R}}^{\prime}\), \({\text{C}}^{\prime}\)) denote the output dimensions, (R, C) are the input dimensions, S = stride, (\({\text{D}}_{\text{r}}\), \({\text{D}}_{\text{c}}\)) are the dilations, (\({\text{K}}_{\text{r}}\), \({\text{K}}_{\text{c}}\)) are the filter dimensions. The dilation rate was maintained at its default setting of (1, 1).
Padding (“same”) is applied in the following manner:
Pr/2 on the left, Pr − Pr/2 on the right, Pc/2 on top, and Pc − Pc/2 on bottom. When Pr or Pc is odd, the right and bottom receive more padding than the top and left.
$${\text{Output shape }}\left( {\text{“same” padding}} \right) \, = {\text{ math}}.{\text{floor }}\left( {\left( {{\text{input shape }} – \, 1} \right)/{\text{strides}}} \right) \, + \, 1$$
(3)
Activation function
The activation function of a neural network develops non-linearity in a neuron’s output. It regulates whether or not a particular neuron should be triggered. Activation function makes decisions based on weighted sum and bias terms. Using Bayesian optimization, we identified that ‘ReLU’ is the best activation function for the majority of the conv2D layers, and we implemented this using the ‘activation’ argument available in the conv2D layer. It is a well-known activation function for CNN that outperforms the other activation functions in terms of computational efficiency. ReLU returns the same value if the input is positive; otherwise, it returns zero. However, for the second Conv2D layer, the Bayesian search identified spotted ‘Tanh’ as the best fit. Tanh maps the inputs in the range of − 1 to + 1. The mathematical equations for these activations are presented in Eq. (4).
$$\left(\begin{array}{c}{\text{f}}_{\text{ReLU}} \left({\text{x}}\right) \text{=} \, {\text{max}} \, {\text{(x, 0)}}\\ {\text{f}}_{\text{Tanh}} \, \left({\text{x}}\right) \, \, \text{=} \, \frac{{\text{e}}^{\text{x}}\text{ } – {\text{e}}^{\text{-x}}}{{\text{e}}^{\text{x}}\text{ } + {\text{e}}^{\text{-x}}}\end{array}\right)$$
(4)
MaxPooling2D
A pooling layer is commonly added after a convolutional layer. It performs a downsampling operation on the output of the conv2D layer and outputs a lower-dimensional feature map to the next layer. Among the several pooling operations, MaxPooling2D is the most widely implemented option. It requires minimal computing expenses to extract the most salient features from the input data. We implemented MaxPooling2D using layers. MaxPooling2D() class from TensorFlow’s Keras API. The MaxPooling2D frame is defined by the dimensions of the pooling window. For most of the pooling layers, we used the 3 × 3 pooling window, which indicates that from the convolutional output, the window selects the top-left 3 × 3 region and fetches the max value from the 9 elements of the 3 × 3 block and stores this into the output feature map of the MaxPooling2D. Likewise, the pooling window covers all the convolutional output until it reaches the bottom right 3 × 3 region of the convolutional output. The window moves along the spatial dimensions (i.e., width and height) of the convolutional output specified by the stride size. Like conv2D, MaxPooling2D utilizes “valid” padding and “same” padding. In our proposed study, we employed three MaxPooling2D layers with a 3 × 3 stride and “same” padding, and one with a 2 × 2 stride and “valid” padding. The output shape of a feature map using “valid” padding can be estimated by Eq. (5).
$$\text{Output shape }\left(\text{“valid” padding}\right)\text{= math.floor ((input shape – pool size) / strides) + 1}$$
(5)
Kernel regularizer
During optimization, we can apply penalties to layer parameters using regularizers. By using the kernel regularizer (L2), a penalty was applied to the layer’s kernel. L2 regularization, also known as Tikhonov’s87 or ridge regularization, helps to develop robust models. The penalties are added to the loss function in order to minimize the weight matrix (w) values, which aids in reducing overfitting significantly. We used the most commonly used categorical cross-entropy loss for our proposed multiclass classification. The mathematical equation of the cost function of L2 regularization is shown in Eq. (6).
$${\text{Cost}}\;{\text{function}} = {\text{Loss}} + \frac{\uplambda }{{2{\text{m}}}}\left[ {\sum {{{\left\| {\text{w}} \right\|}^2}} } \right]$$
(6)
\(\lambda\) is the regularization parameter, and m is the number of inputs. According to Bayesian tuning, the values of \(\lambda\) in the 4 conv2D layers are 0.0001282, 0.0004593, 0.01, and 0.0011262, respectively.
Batch normalization
Like normalized input, batch normalization88 normalizes the output or activation of a preceding layer using mini-batches rather than the entire dataset. In the proposed CNN architecture, we used four batch normalization layers between each conv2D and MaxPooling2D layer. It substantially smooths out the optimization landscape, which causes the gradients to behave more predictably and consistently, allowing for speedier training and convergence89. Batch normalization also impacts weight initialization and controls overfitting by inducing minor regularization through randomness among batch-dependent mean and variance for each neuron activation. The batch’s activation values following this process had a zero mean and unit standard deviation. The mathematical equation of the batch normalization operation is expressed in Eq. (7). In addition to this, it also uses two learnable parameters named γ (true mean of activation) and β (true variance of activation) for each neuron to calibrate the mean and variance expressed in Eq. (8).
$${\text{x}}_{\text{normalized }}^{(\text{i})}=\frac{{\text{x}}^{(\text{i})}-{\upmu }_{\text{B}}}{\sqrt{{\upsigma }_{\text{v}}^{2}-\upvarepsilon }}$$
(7)
$${\text{y}}^{(\text{i})}=\upgamma {\text{x}}_{\text{normalized }}^{(\text{i})}+\upbeta$$
(8)
where µB and \({\upsigma }_{\text{v}}^{2}\) are the mini-batch mean and mini-batch variance, respectively; ε is a constant for numerical stability.
Flatten layer
The output of the last pooling layer was flattened before passing through the dense layers. Flatten layer computes dimensions from MaxPooling2D output tensors. It reshapes the output tensor into a vector (1D array) with a shape equal to the entire tensor’s constituents (without batch dimension). So, after flattening ‘5, 5, 64’ into “1600,” we obtained a total of 1600 features for our classification.
Training the custom CNN model
To employ the proposed CNN as an effective feature extraction tool, we trained it using the conventional classifier. The typical CNN allocates one or more fully connected layers (and also some regularization layers) for classification tasks. The present study optimized the configuration of such layers using the Bayesian algorithm. We employed a single dense layer with 32 neurons, a kernel regularizer (l2) of 1.35e−05, and “relu” as activation. Following this layer, we used a batch normalization layer and a dropout layer90. We set a dropout rate of 0.3 (i.e., randomly drop 20% of the neurons) to prevent overfitting. In the output layer, we used 10 neurons, which correspond to the 10 classes of tomato leaves, and the probabilistic ‘softmax’ function as activation. Additionally, we added two callback operations: EarlyStopping and ReduceLROnPlateau. EarlyStopping tracked the validation loss and stopped the training process if there was no improvement after 20 epochs, hence reducing the overfitting concerns. On the other hand, ReduceLROnPlateau altered the learning rate based on validation loss, lowering it by 0.2 if no improvement (i.e., learning stagnated) was seen after 5 epochs. Finally, we compiled the CNN model with the ‘categorical cross-entropy’ loss function and the ‘Adam’ optimizer, with a learning rate of 0.001 and beta_1 and beta_2 values of 0.9 and 0.999, respectively. The model was trained for 96 epochs.
Assessment of the custom CNN model
The performance of the CNN model considering the number of epochs used during the model’s training phase has been shown in Figs. S4 and S5. The proposed model has effectively learned from the training data, as evidenced by the continuous reduction in training loss (Fig. S4). Additionally, the consistently decreasing validation loss suggests that the model exhibits good generalization capabilities, avoiding overfitting to the training data. Regarding accuracy (Fig. S5), the model’s performance has consistently improved across epochs, both in training accuracy and validation accuracy. The gradually rising validation accuracy underscores the model’s predictive capabilities to enhance its accuracy on unseen data. In essence, the convergence between training and validation metrics in Figures S4 and S5 justifies the proposed CNN model’s streamlined design, as well as its adequacy and reliability for capturing underlying data patterns and its effectiveness as a feature extraction framework.
Feature extraction and collection
The classical CNN captures multidimensional features using a flatten layer, which are then fed into dense layers for classification. In a dense layer, each neuron is connected to the preceding layer’s neurons. When a model has an excessive number of dense layers and neurons within them, the model’s parameters rise dramatically, which makes the model complex, computationally expensive, and time-consuming. Moreover, the interpretation of the model becomes complicated due to the complex network. Machine learning (ML), on the other hand, has an extensive range of advanced classifiers with a diverse set of hyperparameters to boost performance. These models offer powerful capabilities, faster execution times, and are simple to interpret. However, the most important considerations in adopting ML algorithms are their low computational costs and data efficiency. Therefore, in our suggested tomato leaf disease study, we took into account the benefits of these ML methods for the classification of CNN-generated features. To accomplish this, we first separated the feature extraction framework from the traditional classification framework by eliminating the last four layers of our custom convolutional neural network. After that, our model was left with 14 layers, i.e., 1 input, 4 conv2D, 4 MaxPooling2D, 4 batch normalization, and 1 flatten layer. We then extracted features for training, validation, and testing datasets using the ‘predict()’ method. The extracted features and corresponding labels are appended using Python’s ‘append()’ method to six Python lists (e.g., x_train_hybrid, y_train_hybrid; x_val_hybrid, y_val_hybrid; and x_test_hybrid, y_test_hybrid).
Finally, using Numpy’s ‘concatenate()’ function, we concatenated all the feature and corresponding label arrays. For a single image, our proposed framework generated 1600 features. Therefore, the shapes of training, validation, and testing independent and dependent values are (17,920, 1600), (17,920,); (2728, 1600), (2728,); and (2716, 1600), (2716), respectively. The feature extraction framework required approximately 5.8 min (0.019 s/image), 0.79 min (0.017 s/image), and 0.8 min (0.018 s/image) of extraction time for training, validation, and testing datasets, respectively. Some of the extracted features are shown in Fig. S6.
Feature selection by Boruta
Feature selection in ML and DL is a crucial task for improving model generalization, reducing overfitting, and enhancing computational efficiency by prioritizing relevant features. This process is particularly valuable in addressing model complexity, minimizing noise, and acquiring domain-specific insights. Many prior studies have explored dimensionality reduction techniques and feature selection algorithms to select relevant features. While feature selection algorithms may excel at choosing non-redundant features, they can inadvertently overlook significant redundant ones. Dimensionality reduction techniques also fail to convey the importance of individual features for subsequent analysis91. Therefore, this study considered a wrapper algorithm named Boruta92 for filtering the important features for the classification of diseases. The Boruta algorithm works by creating randomness in the system and iteratively searching for relevant features, along with feature ranking.
The Boruta feature selection procedure is a methodical process that involves the following steps:
-
1.
Duplication of original features: Initially, the original features are duplicated to generate shadow features.
-
2.
Shuffling and merging: The duplicate features undergo shuffling to eliminate correlations with the response variable, and subsequently, they are merged with the original features.
-
3.
Training tree-based machine learning classifier: The expanded dataset, comprising both original and shadow features, is used to train the tree-based machine learning classifier (e.g., random forest).
-
4.
Z-scores: Then, the computed Z scores for the features are gathered, and the shadow feature with the maximum Z-score (MZS) is identified.
-
5.
‘Hit’ assignment: Each feature that scores better than the MZS receives a hit.
-
6.
Two-sided equality test: The feature with undetermined importance is subjected to a two-sided equality test with the MZS.
-
7.
Identification of important features: Original features with significantly higher importance than the MZS (i.e., Z-scores of original features exceeding the MZS) are designated as important, while those with significantly lower importance are labeled as unimportant and permanently removed.
-
8.
Removing shadow features: Then all shadow features are removed.
-
9.
Iterative evaluation: These above steps are iterated until the importance is assigned for all features or a fixed iteration is reached, ensuring a comprehensive assessment of the features’ significance relative to the MZS.
We implemented the Boruta feature selection algorithm by leveraging the ‘BorutaPy’ class from the ‘boruta’ Python package. The selection procedure involved instantiating the Boruta feature selector object using the syntax ‘BorutaPy (RandomForestClassifier (n_estimators = 491), n_estimators = ’auto’, perc = 100, alpha = 0.05, max_iter = 100)’ and subsequently fitting it to the training features (x_train_hybrid, y_train_hybrid). BorutaPy provides the ‘support_’ attribute, which yields a boolean array indicating the significance of each corresponding feature, and the ‘rank_’ attribute furnishes the feature ranking based on importance. Through successive iterations, the algorithm identified 1175 (out of 1600) important features with corresponding ranks. Once the relevant features were identified using the ‘transform()’ method for training, validation, and testing sets, we stored them in three distinct Numpy arrays. Afterward, the Bayesian optimized machine learning models were trained using the Boruta-selected features.
The proposed novel feature extraction and selection mechanisms are elaborately depicted in Fig. 3.
Proposed Bayesian optimized classification approaches
In this study, we examined seven Bayesian optimized machine learning models for classification: Multinomial logistic regression (CNN-MLR), Gaussian Naïve Bayes (CNN-GNB), Support vector machine (CNN-SVM), K-Nearest Neighbours (CNN-KNN), XGBoost (CNN-XGBoost), Random Forest (CNN-RF), and Stacking (CNN-Stacking) classifiers. The XGBoost, Random Forest, and Stacking classifiers are commonly known as Ensemble Learning (EL) techniques. EL is a subset of machine learning where the predictions of different learning algorithms are integrated into the prediction pipeline to achieve improved decisions. It is capable of solving sophisticated problems with higher performance metrics where standalone models may not perform well. Ensemble learning strategies, for instance, bagging (e.g., random forest) reduces variance, boosting (e.g., XGBoost) reduces bias, and stacking improves the prediction capabilities.
Multinomial logistic regression
Multinomial logistic regression (MLR), often known as the softmax classifier, is a probability-based classification technique. MLR is an extension of logistic regression in which the outcomes can be more than two. For k target classes, we feed the model with a set of features \(\text{X}\)(x1, x2, x3,…,xn) and it outputs a computed probability vector Y(y1, y2, y3,…,yk) where \(\sum_{\text{i}=1}^{\text{k}}\text{y}=1.\) The predicted class is the probability of the highest class. In our study, k = 10, therefore the softmax algorithm can be defined as (Eq. 9):
$$\text{P}\left(\text{y}=\left.\text{i}\right|{\textrm{z}}_{\text{i}}\right)=\frac{{\textrm{e}}^{{\text{z}}_{\text{i}}}}{\sum_{\textrm{j}=1}^{\text{k}}{\textrm{e}}^{{\text{z}}_{\textrm{j}}}}; \quad {\textrm{where}}{\text{ z}}_{\textrm{i}}=\text{logit function}={\textrm{w}}_{\text{i}} \cdot \textrm{x}+{\text{b}}_{\textrm{i}}$$
(9)
where \(\text{P}\left(\text{y}=\left.\text{i}\right|{\text{z}}_{\text{i}}\right)\) is the probability of the target class y being class i given the linear score \({z}_{i}\). The loss function (difference between the predicted probabilities and the actual class labels) for MLR is the cross-entropy loss, or negative log-likelihood. Equation 10 shows the loss function for only the correct class (k).
$$\text{Cross entropy loss }(\text{ypred},\text{y})=-\text{log }(\text{ypred}(\text{k}))= -\text{log}\frac{{e}^{{z}_{i}}}{\sum_{j=1}^{k}{e}^{{z}_{j}}}$$
(10)
KNN
K-Nearest Neighbors (KNN) is a simple but powerful algorithm used for classification and regression problems. It is an instance-based algorithm, meaning it relies on instances (or examples) from the training data to make predictions. For our proposed scenario, we classified 10 types of tomato leaves based on 1175 features. Given these features, we defined a tomato leaf in our dataset as a point in a multi-dimensional space where the axes represented the features. To classify a new tomato leaf, we plotted it on this graph, then found the ‘k’ tomato leaves closest to it (where ‘k = 2’ is chosen by Bayesian tuning). New tomato leaf was then classified according to the majority class of these ‘k’ neighbors. For this, we calculated the ‘euclidean’ distance between the new entry and the rest of the entries in the dataset using Eq. (11).
$$\text{d} = {{\text{sqrt[(x}}_{2}^{1}-{\text{x}}_{1}^{1}\text{)}}^{2}+ \text{ } {{\text{(x}}_{2}^{2}-{\text{x}}_{1}^{2}\text{)}}^{2} + \ldots \ldots \ldots +{{\text{(x}}_{2}^{1175}-{\text{x}}_{1}^{1175}\text{)}}^{2}\text{]}$$
(11)
where d is the distance and \({x}_{2}^{1}, {x}_{1}^{1}\) are the 1st feature of new datapoint and existing datapoint respectively. Using this formula, we calculated the distances for all the datapoints, and then we sorted all the distances in ascending order and picked the first ‘k = 2’ points. Finally, the new tomato leaf data point was assigned to the class that had the majority among the ‘k = 2’ nearest neighbors.
SVM
Support Vector Machines (SVM) is a supervised machine learning algorithm that is commonly used for classification and regression tasks. The SVM algorithm works by finding the hyperplane that maximizes the margin between the two classes. In 2-D space, this hyperplane is a line. Suppose we have a set of training examples {(x1, y1),…, (xn, yn)}, where each xi ∈ Rd(xi is a d-dimensional real vector), and yi ∈ {−1, 1} (yi represents the class label). The SVM algorithm seeks to find the optimal hyperplane that separates the two classes by solving the following (Eq. 12) optimization problem:
$${\text{Minimize }}\left( {1/2} \right) \, \left| {\left| {\text{w}} \right|} \right|^{2} {\text{ subject to y }}({\text{w}} \cdot {\text{x}} + {\text{b}}) \, \ge 1{\text{ for all }}1 \, \le \, i \, \le {\text{ n}}.$$
(12)
where w is the normal vector to the hyperplane, ||w|| is the Euclidean norm of w, b is the bias term, and ‘⋅’ denotes the dot product. Let’s denote features by x1, x2,…,x1175. The values y = 1 and y = −1 correspond to ‘TMV class’ and ‘non-TMV class,’ respectively. Suppose the SVM algorithm has found the optimal hyperplane to be w⋅x + b = 0, where w = [w1, w2,… …,w1175] is the weight vector, x = [x1, x2,… …,x1175] is the input vector, and b is the bias term. We can interpret w₁ as the weight for feature-1 and w₂ as the weight for feature-2. A tomato leaf is classified as ‘TMV class’ if w⋅x + b ≥ 1, and ‘non-TMV class’ otherwise. SVM is effective for linearly separable data. If the data is not linearly separable, kernel tricks such as polynomial kernel, radial basis function (RBF) kernel, etc. are used to map the data to a higher-dimensional space where it is linearly separable. As SVM works mainly for binary classification (2 groups), we implemented this by using the one-vs-rest approach, where hyperplane separates a class (i.e., TMV) by keeping it in group-1 and the rest of the 9 classes (i.e., bacterial spot, early blight, and so on) in group-2. Therefore, the classifier utilizes 10 binary SVMs for 10 classes. For test data, each binary SVM predicts the probability, and the output class is determined by the maximum probability score.
RF
A random forest trains multiple decision trees on random subsets of the training data. The randomness here serves two purposes: each decision tree in the random forest is trained on a different set of data points, and each test in a decision tree is chosen from a random subset of features. Random forests make predictions by having each decision tree in the forest predict the class of the input and then choose the class that was most frequently predicted (majority voting). This is done to increase the diversity of the trees in the forest and decrease the variance of the predictions. Let X = {x1, x2,…, xn} be the set of input features, and Y = {y1, y2, …, yn} be the corresponding classes. Assume there are m different possible classes. A random forest with T decision trees is trained by: For t = 1 to T, choose a random subset of the training data (Xt, Yt) from X and Y. Train the t-th decision tree by choosing a random subset of the features xt. Finding the best tests that split Xt into two groups that minimize the impurity of Yt. To make a prediction for a new input x: Let Yt(x) be the prediction of the t-th decision tree. The prediction of the random forest is argmax_i (sum_t [Yt(x) = i]), which is the class that the most decision trees predicted. Here [Yt(x) = i] is the indicator function, which is 1 if Yt(x) = i, and 0 otherwise, and sum_t is summing over all trees in the forest. The power of random forests comes from their ability to model complex, non-linear decision boundaries, and their robustness to overfitting due to the aggregation of multiple diverse trees.
XGBoost
XGBoost93 is a machine learning algorithm based on gradient boosting framework, which is used for both regression and classification problems. It is a very flexible model, as we can define our own objective functions and evaluation criteria. The goal of XGBoost is to minimize the following (Eq. 13) objective function:
$${\text{Objective function }} = \, \sum {\text{L }}({\text{y}}_{{\text{i}}} ,) \, + \, \sum \Omega \left( {{\text{f}}_{{\text{i}}} } \right)$$
(13)
Here, L is the loss function that measures the difference between the actual output yi and the predicted output \({\widehat{y}}_{i}\). Ω is the regularization term that prevents the model from overfitting, and f are the decision trees. The regularization term Ω in XGBoost’s objective function is given by (Eq. 14):
$$\Omega \left( {\text{f}} \right) = \upgamma {\text{T }} + 1/2 \uplambda \sum {\text{w}}^{{2}}$$
(14)
Here, T is the number of leaves in the tree, w is the score on the leaves, γ is the parameter for controlling the complexity of the tree, and λ is the L2 regularization term on the scores to avoid overfitting. XGboost serves different objective functions, such as ‘multi:softprob’ and default ‘binary:logistic’ based on the nature of the data. The earlier objective function is dedicated to multiclass classification, while the latter one is for binary classifiers. For our multiclass classification study, the objective function was “softprob,” which calculates the predicted probability. The multiclass softprob loss, also known as the softmax cross-entropy loss, is commonly used for multiclass classification problems. In the context of tomato leaf classification, let’s assume we have “C” classes of tomato leaves (e.g., bacterial spot, TMV, TYLCV, etc.). The softprob loss is defined as: For a single training example with input features x and ground truth label y (represented as a one-hot vector), the softmax function (Eq. 15) computes the predicted probabilities for each class.
$${\text{P}}_{{\text{i}}} = {\text{ exp }}\left( {{\text{Z}}_{{\text{i}}} } \right) \, /{\text{ sum }}\left( {{\text{exp }}\left( {{\text{Z}}_{{\text{j}}} } \right)} \right){\text{ for i }} = {\text{ 1 to C}}$$
(15)
where Zi represents the logits (pre-softmax scores) for class i. The softprob loss is then computed as the cross-entropy between the predicted probabilities and the ground truth label:
L = −sum (y * log(P)), where “*” denotes element-wise multiplication and log is the natural logarithm.
To train a classification model using this loss, we can apply gradient descent or another optimization algorithm to minimize the average loss over a training dataset.
GaussianNB
Gaussian Naïve Bayes (GNB) is a probabilistic classification algorithm that assumes the features are normally distributed. For tomato leaf disease classification, we used GNB to predict the class of a tomato leaf based on its features. Let’s denote the features of a tomato as X = {X1, X2,…, Xn}, where Xi represents the value of the i-th feature. We want to classify the tomato leaf into one of the classes C = {C1, C2, …, Ck}, where Cj represents the j-th class. The Gaussian Naïve Bayes classifier estimates the probability P(Cj|X) using Bayes’ theorem (Eq. 16):
$${\text{P}}\left( {{\text{C}}_{{\text{j}}} |{\text{X}}} \right) \, = \, \left( {{\text{P}}\left( {{\text{C}}_{{\text{j}}} } \right) \, *{\text{ P}}\left( {{\text{X}}|{\text{C}}_{{\text{j}}} } \right)} \right){\text{/P}}\left( {\text{X}} \right)$$
(16)
where P(Cj|X) is the posterior probability of class Cj given the features X; P(Cj) is the prior probability of class Cj; P(X|Cj) is the likelihood of the features X given class Cj; and P(X) is the evidence probability, which acts as a normalizing constant. Now, let’s assume that the features Xi are continuous and follow a Gaussian distribution within each class Cj. We can represent the likelihood P(X|Cj) as a product of individual feature probabilities by Eq. (17). And the individual feature probabilities can be estimated by Eq. (18) assuming a Gaussian distribution.
$${\text{P}}\left( {{\text{X}}|{\text{C}}_{{\text{j}}} } \right) \, = {\text{ P}}\left( {{\text{X}}_{{1}} |{\text{C}}_{{\text{j}}} } \right) \, *{\text{ P}}\left( {{\text{X}}_{{2}} |{\text{C}}_{{\text{j}}} } \right) \, * \, … \, *{\text{ P}}\left( {{\text{X}}_{{\text{j}}} |{\text{C}}_{{\text{j}}} } \right)$$
(17)
$${\text{P}}\left( {{\text{X}}_{{\text{i}}} |{\text{C}}_{{\text{j}}} } \right) \, = \, \left( {{1 }/{\text{ sqrt }}\left( {{2}\uppi \, * \, \upsigma^{2} } \right)} \right) \, *{\text{ exp }}\left( { – \left( {{\text{X}}_{{\text{i}}} \, – \, \upmu } \right)^{2} /\left( {{2 }* \, \upsigma^{2} } \right)} \right)$$
(18)
where μ is the mean of feature Xi within class Cj, σ2 is the variance of feature Xi within class Cj. To classify a tomato leaf, we calculate the posterior probability P(Cj|X) for each class Cj and assign the tomato leaf to the class with the highest probability.
Stacking
Stacking is a type of EL method where two-level learners are present. The low-level, or level-0, learners and the high-level, or level-1, learner. The level-0 learners are also called base models or estimators, and the level-1 learner is called meta model or final estimator. In stacking, or stacked generalization, the predictions of each base model are stacked together and fed to the meta model. A meta-model is trained based on this output and provides the final prediction. Stacking provides flexibility in model selection and helps to obtain models with robust predictive capabilities. It diminishes the biases94 of base estimators. In this study, we used KNN, SVM, and RF as base estimators and XGBoost as the final estimator. Such a combination of models yielded an outstanding stacked classifier with higher generalization capabilities. The hyperparameters of the base and final estimators are the same as those of standalone models’ hyperparameters, as they were already optimized. Figure S7 shows the hybrid stacking classifier architecture.
The Bayesian optimized hyperparameters for all ML classifiers are shown in Table 2.
Implementation
The proposed hybrid models were trained using mini batch gradient descent of batch size 8, which implies that parameters were updated after a batch of 8 samples. Table S4 outlines the specifications of the environmental setup.
The following algorithm 2 provides a comprehensive depiction of the various phases encapsulated within the proposed methodology.
Evaluation metrics
Machine learning models need to be evaluated properly to interpret their performance. We used a wide range of evaluation metrics to assess the performance of our proposed Bayesian optimized hybrid models, namely accuracy, precision, recall, f1-score, and Matthew’s correlation coefficient (MCC). Accuracy is a performance metric that provides the overall performance of the model across all classes. It is the ratio of the sum of correct predictions to the total number of predictions. Precision is measured by the ratio of correctly classified positive samples to the total number of classified positive samples, either correctly or incorrectly. Recall is calculated by the ratio of correctly classified positive samples to the total number of actual positive (ground truth positive) samples. It measures the capability of the model to identify positive outcomes. Ideally, if precision increases, recall decreases. So, there is another metric named f1-score to tradeoff between precision and recall. It is a balanced metric commonly used to analyze the performance of a classification model. The f1-score is the harmonic mean of precision and recall. Precision, recall, and f1-score range between 0 and 1 (0–100%), where 1 means perfect and 0 means worst performance. The mathematical equations for accuracy, precision, recall, and f1-score are presented in Eqs. (19), (20), (21), and (22), respectively. Another quality measurement metric for multiclass classification is the MCC. For multiclass classification, it ranges between − 1 and + 1 or 0 and + 1, where + 1 means perfect classification. A high MCC value denotes that the model performs well in all positive (TP, FP) and negative (TN, FN) categories. Precision, recall, and f1-score are asymmetric measures, which means if we change the class labels, it affects the metrics values. MCC is a symmetric measure that determines the correlation between the true class and the predicted class. For ‘K’ classes, using the confusion matrix (C), we can define MCC using Eq. (23).
$$\text{Accuracy }=\frac{\sum \text{Correct Predictions}}{\text{Total number of predictions}}=\frac{\text{TP+TN} }{\text{TP+TN+FP+FN}}$$
(19)
$$\text{Precision }= \frac{\text{TP}}{\text{TP+FP}}$$
(20)
$$\text{Recall }=\frac{\text{TP}}{\text{TP+FN}}$$
(21)
$$\text{F1-score} =2 \times \frac{{{\text{Precision}} \times {\text{Recall}}}}{{{\text{Precision}} + {\text{Recall}}}}$$
(22)
where TP = true positive (model correctly classified a positive item as positive), TN = true negative (model correctly classified a negative item as negative), FP = false positive (model incorrectly classified a negative item as positive), and FN = false negative (model incorrectly classified a positive item as negative).
$$\text{MCC}=\frac{{\text{ab}}-{\sum }_{\text{k}}^{\text{K}}{{\text{P}}}_{\text{k}}{{\text{t}}}_{\text{k}}}{{\text{sqrt}}\left[\left({\text{b}}^{2}-{\sum }_{\text{k}}^{\text{K}}{{\text{P}}}_{\text{k}}^{2}\right)\left({\text{b}}^{2}-{\sum }_{\text{k}}^{\text{K}}{{\text{t}}}_{\text{k}}^{2}\right)\right]}$$
(23)
where \(\text{a} = {\sum }_{\text{k}}^{\text{K}}{{\text{C}}}_{\text{kk}}\), the number of samples correctly predicted (total), \(\text{b} = {\sum }_{\text{i}}^{\text{K}}{\sum }_{\text{j}}^{\text{K}}{{\text{C}}}_{\text{ij}}\), the total number of samples, \({\text{t}}_{\text{k}}\text{=}{\sum }_{\text{i}}^{\text{K}}{{\text{C}}}_{\text{ik}}\), the number of times ‘class k’ actually occurred, and \({\text{P}}_{\text{k}}\text{=}{\sum }_{\text{i}}^{\text{K}}{{\text{C}}}_{\text{ki}}\), the number of times ‘class k’ predicted.