A Novel Approach to Classifying Breast Cancer Histopathology Biopsy Images Using Bilateral Knowledge Distillation and Label Smoothing ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Hindawi Computational and Mathematical Methods in Medicine Volume 2021, Article ID 4019358, 11 pages https://doi.org/10.1155/2021/4019358 Research Article A Novel Approach to Classifying Breast Cancer Histopathology Biopsy Images Using Bilateral Knowledge Distillation and Label Smoothing Regularization Sushovan Chaudhury ,1 Nilesh Shelke ,2 Kartik Sau ,1 B. Prasanalakshmi ,3 and Mohammad Shabaz 4 1 University of Engineering and Management, Kolkata, India 2 Priyadarshini Indira Gandhi College of Engineering, Nagpur, India 3 Department of Computer Science, King Khalid University, Abha, Saudi Arabia 4 Arba Minch University, Ethiopia Correspondence should be addressed to B. Prasanalakshmi; prengaraj@kku.edu.sa Received 3 September 2021; Revised 1 October 2021; Accepted 1 October 2021; Published 20 October 2021 Academic Editor: Deepika Koundal Copyright © 2021 Sushovan Chaudhury et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Breast cancer is the most common invasive cancer in women and the second main cause of cancer death in females, which can be classified benign or malignant. Research and prevention on breast cancer have attracted more concern of researchers in recent years. On the other hand, the development of data mining methods provides an effective way to extract more useful information from complex databases, and some prediction, classification, and clustering can be made according to the extracted information. The generic notion of knowledge distillation is that a network of higher capacity acts as a teacher and a network of lower capacity acts as a student. There are different pipelines of knowledge distillation known. However, previous work on knowledge distillation using label smoothing regularization produces experiments and results that break this general notion and prove that knowledge distillation also works when a student model distils a teacher model, i.e., reverse knowledge distillation. Not only this, but it is also proved that a poorly trained teacher model trains a student model to reach equivalent results. Building on the ideas from those works, we propose a novel bilateral knowledge distillation regime that enables multiple interactions between teacher and student models, i.e., teaching and distilling each other, eventually improving each other’s performance and evaluating our results on BACH histopathology image dataset on breast cancer. The pretrained ResNeXt29 and MobileNetV2 models which are already tested on ImageNet dataset are used for “transfer learning” in our dataset, and we obtain a final accuracy of more than 96% using this novel approach of bilateral KD. 1. Introduction though invasive or infiltrating ductal carcinoma may be rapidly progressive, but it is still a treatable condition at least Breast cancer is the most lethal type of tumour in women when identified in its initial stages [1]. Machine learning is a worldwide, second only to lung cancer in terms of preva- branch of artificial intelligence that employs a variety of lence. In the most common type of breast cancer, which is statistical, probabilistic, and optimization techniques that popularly known as invasive ductal carcinoma (IDC), the allow computers to “learn” from past examples and to detect cancer develops in the milk ducts of the breast and then rap- hard-to-discern patterns from large, noisy, or complex data- idly spreads to involve the surrounding structures. Invasive sets. As a result, machine learning is frequently used in ductal carcinoma is so progressive that it can even infiltrate cancer diagnosis and detection. Machine learning and deep the lymph nodes and the circulatory system, which leads to learning approaches may help radiologists to a large extent the cancer spreading to various parts of the body. Even by helping to detect breast cancer at an early stage by
2 Computational and Mathematical Methods in Medicine identifying useful patterns in malignant cells. The learnt be completely different from the convergence of the teacher patterns can be used to classify the unknown images of the network [9]. breast as benign or malignant. Mammography, ultrasonogra- The basic steps of knowledge distillation are given as phy, and histological biopsies are some of the numerous follows. options for BCD. Mammograms can help to understand calcification in human breast, whereas studying histology (1) Train the teacher network: The complex teacher images remains a challenge due to its complexity in pattern network is first trained using the whole dataset on identification. This study analyzes the gene activities of the high-performing GPUs survival vs. deceased for each therapy, and the potential bio- (2) Establish a link of correspondence: The student net- markers will help to identify the best therapy for the patients work is then designed and a correspondence needs to based on their gene expression test. This model has very high be formed between each output of the simple student accuracy levels, and it uses a hierarchical model as a tree that network and the complex teacher network includes one-versus-rest classifications. In 2012, there were roughly 1.7 million new cancer cases (representing 25% of (3) Forward propagation of knowledge through the all malignancies in women) and 0.5 million cancer deaths teacher network: Feedforward data through the (representing 15% of all cancer fatalities in women). In 140 teacher network to get all intermediate outputs; nations, ductal carcinoma is the most prevalent cancer diag- apply augmentation (if any) to the same nosis among females, and it is the primary cause of cancer death in 101 nations. According to McGuire [2], histological (4) Backpropagation through the student network: The study of breast cancers can also reveal the stage of the disease output from the teacher and the correspondence or the extent to which the cancer has spread. This is mea- relation are used to backpropagate the error to the sured in two ways: by the size of the tumour and by assessing student. Therefore, the student network learns to the involvement of regional lymph nodes. Both of these imitate the “knowledge” of the teacher variables provide valuable prognostic data for patient care. Smith et al. [3] recommend screening practices for the early 2. Related Work recognition of breast cancer. Computer-aided diagnosis of ductal carcinoma can advance the performance of radiolo- Koboldt et al. [10] studied key ductal carcinoma by genomic gists in several folds as explained by Filipczuk et al. and DNA copy arrays, DNA methylation, exome sequencing, Kowal et al. [4, 5]. The simple idea of the CAD system is to messenger RNA arrays, micro-RNA sequencing, and take raw descriptions from different modalities like mammo- opposite-segment protein arrays. The different classifica- gram and histology images and preprocess them. This pre- tions of breast cancers are researched in accordance with processing includes amending size, contrast, and brightness WHO standards [11]. They outline the methods required and performing augmentation in case input data which are to examine the histological description of breast cancer, scarce. The next step involves segmentation to critically iden- similar to the one explained by Veta et al. [12]. With the tify the regions of interest in the image so that the system introduction of whole slide imaging (WSI) scanners, which only learns the relevant part of the image and not the noise can be affordable and high-throughput histopathology slide part. The third stage is to extract features from a region of digitalization, this study topic has become particularly rele- interest (ROI), which is usually done using CNN, which han- vant, with the goal of substituting the optical microscope dles automatic feature extraction by producing feature maps as the principal tool used by pathologists. Breast cancer at each convolution layer and regularising the models with classification from histological photos is of considerable dropouts so that they do not fit [6]. The higher-order feature therapeutic value, according to [13], although it is yet to be maps are flattened, and classification is performed using the studied. The multiple classification of ductal carcinoma has class probabilities generated by a softmax layer. These a purpose to find the subcategories of breast cancer (ductal models, however, are massive, containing millions (if not bil- carcinoma, fibroadenoma, lobular carcinoma, etc.). How- lions) of parameters, and so cannot be used on edge devices. ever, multiclassification of breast cancer from histopathol- The technique is being tested on the BACH dataset, a large ogy pictures faces 2 major challenges: (1) sorting of binary histology dataset for breast cancer detection, to reduce com- classes (benign and malignant) and (2) minor variances in putational complexity. The concept is to represent compres- various classes due to the wide diversity of high-resolution sion by having a larger and more sophisticated pretrained picture appearances, strong coherency of malignant cells, network that teaches a smaller network (students) step by and large colour distribution inhomogeneity. As per Sharma step (teacher). Indeed, the theoretical search space for more et al. [14], special attention was given to histopathology complicated models is bigger than that of a smaller system images in the BACH dataset. A model of careful consider- [7, 8]. However, if we suppose that the equivalent (or even ation was proposed where the network emphasizes learning comparative) convergence can be attained using a smaller from patches and records good accuracy in multilevel classi- link, then the convergence space of the teacher system should fication. Kausar et al. [15] experimented with a number of connect with the solution space of the student system. Trag- deep convolutional models, using colour normalisation as a ically, that by itself does not ensure union of the solution normalisation strategy before applying augmentation to the space for the student network and the teacher network in a input data. Praveena Anjelin and Ganesh Kumar [16] similar area. The convergence of the student network may experimented with different transfer learning techniques
Computational and Mathematical Methods in Medicine 3 and pretrained models such as MobileNetV2, AlexNet, each image, these photos are categorized as [N] normal, ResNet, and VGGNet on the BACH dataset. Several efforts [B] benign, [I] in situ carcinoma, or invasive carcinoma, in the subject of knowledge distillation have also been com- with 100 labels each. The dataset was distributed into pleted. Ba and Caruana [17] investigated with the CIFAR10 70 : 30 train-test subgroups for the studies we ran. Because dataset and detected that a large network is capable of train- of the variances in staining methods, scanner colour ing smaller networks in patches that are quite shallow but response, and tissue preparation, the staining process is a are capable to match the accuracy of deep neural nets, and complex process that might result in undesired colour vari- since they need not be necessarily deep, the computational ations among tissue types. We employ a technique called complexity can be reduced. Hinton et al. [18] found startling structure-preserving colour normalisation with sparse strain findings on MNIST, demonstrating that condensing the separation since colour variance across pictures can affect knowledge from an ensemble of models into a single model performance. The approach uses one picture as the target, can greatly improve the acoustic model of a widely used and the other images are normalised by merging their sepa- commercial system [19]. Often the finest outcomes are rate stain density maps with the target image’s stain colour attained by assembling millions of individual classifiers, base, maintaining the morphology. After that, the modified which is a computationally expensive task. In [20], dataset is scaled between 0 and 1, and each channel is nor- researchers offer a strategy for “compressing” big, compli- malised to the ImageNet dataset. ImageNet project is a large cated ensembles into smaller, quicker models with little to visual database used in visual object recognition in software no performance loss. Dipalma et al. [21] utilised knowledge research and is tested and trained on more than 14 million distillation with model compression to address challenging images of various types, both labelled and unlabelled. We problems in whole slide images, such as histology. Knowl- use online data augmentation techniques to provide variety edge distillation has been used in object detection by Dai to the training dataset, allowing the network to generalise et al. [22], face recognition by Ge et al. [23], semantic to new datasets. Because the histological pictures are rota- segmentation by He et al. [24], classification of chest X-ray tionally invariant, horizontal and vertical flips with rotations abnormalities by Ho and Gwak [25], and to regain land sur- ranging from -25 to 25 degrees were also included. As face temperature from MODIS daytime mid-infrared data augmentation methods, random zoom and intensity shifts by Tang and Wang [26]. Knowledge distillation was used were also utilised. to enhance the computational efficiency of breast cancer diagnosis by Garg et al. and Thiagarajan et al. [27, 28]. They 3.2. Knowledge Distillation in Theory. Knowledge distillation stress that in a variety of applications, such as medical image refers to distilling facts from a complex teacher model into a analysis, weakly supervised instance labelling utilising sim- relatively weak and light student model. However, it is can ple picture-level labels rather than costly fine-grained pixel be better summarized as a label smoothing regularization annotations are critical. In contrast to traditional computer technique. The outcome of knowledge distillation depends vision instance segmentation, the challenges we address not only on the similarity information between teacher are characterised by a small number of training photos categories but also on the regularization of soft goals. The and nonlocal patterns that lead to diagnosis. Motivated by softening regularization of the label refers to the model all of these KD applications, as well as the use of KD and training by the exchange of one hot label and soft smoothed computationally efficient neural nets for training large WSI label between the teacher and the student. Loss function of as in BACH images of the breast, we decided to focus on label smoothing regularization (LSR) for a network S is as bilateral knowledge distillation, in which the teacher and follows. For every training instance x, S outputs the proba- student learn from each other and improve each other’s bility of each label as shown in the following equation: performance, as shown in [29–31]. 3. Methodology exp ðT a Þ a ∈ f1::Ag: rðajyÞ = soft max ðT a Þ = , ð1Þ 3.1. Dataset and Preprocessing. For our experiments, we used ∑Af−1 T f ICIAR 2018 Grand Challenge on Breast Cancer Histology dataset [32] as described in Figure 1. The dataset is com- prised of haematoxylin and eosin (H&E) stained breast where S is a neural network to train, T f is the logit of the microscopic anatomy research and whole slide images neural network, and S is given in the following equation. labelled as traditional, benign, in-place cancer, or invasive cancer. The clarification was performed by 2 clinical special- ists and images with conflict were disposed of. A total of 400 microscope images are in the collection, out of which 100 n ′ ðaÞ = ð1 − βÞnðaÞ + βvðaÞ, ð2Þ each are normal images, benign images, in situ carcinoma, and invasive carcinoma. All microscopy images are on.tiff format, and their specifications are given below in Table 1. The BACH dataset comprises of about 400 histological where n ′ ðaÞ is the smoothed labelled distribution, is a mix- microscopic images of the breast stained with haematoxylin ture of nðaÞ, and is a fixed distribution vðaÞ with weight β. and eosin (H&E). Conferring to the major cancer type in Now, vðaÞ is uniform distribution as vðaÞ = ½1/A.
4 Computational and Mathematical Methods in Medicine Normal Benign in situ carcinoma Invasive carcin Figure 1: The labelled images of each type of cell used in classification. Table 1: Specifications of microscopy images. S no. Parameter Details 1 Colour model Red, green, blue 2 Size 2048ðW Þ × 1536ðLÞ in pixels 3 Pixel scale 0:42 μm × 0:42 μm 4 Retention area: 10-20 MB (approx.) 10-20 MB (approx.) The smoothed entropy loss Bðn ′ , rÞ is defined over the normal KD, Re-KD takes teachers’ accuracy as the base- smoothed label as in the following equation. line accuracy, which is much advanced than normal KD. For example, in various cases, Re-KD exceeds the normal A KD. The basis of this concept is that the teacher becomes B n ′ , r = − 〠 n ′ ðaÞ log r ðaÞ = ð1 − βÞBðn, r Þ + βBðv, rÞ the student and the student becomes the teacher. Using Re- f −1 KD in the BACH dataset for cancer histopathology biopsy = ð1 − βÞBðn, rÞ + βðC AM ðv, rÞ + BðvÞÞ, images, we found that Re-KD produces better results. ð3Þ 3.4. Bilateral Knowledge Distillation in Theory. For conven- tional knowledge, distillation of the loss function is mathe- where CAM is the Kullback-Leibler divergence and CðaÞ matically defined as follows: denotes the entropy of v and is a constant for the fixed uni- form distribution vðaÞ. M WAC = δ j, j ′ + 〠 ηh ρh ESh , EXh , ð8Þ The loss function for label smoothing of neural network hεH S can be written as in the following equations: where δðj, j ′ Þ represents loss of task-specific nature, j is the M MO = ð1 − βÞBðn, r Þ + βCAM ðv, r Þ, ð4Þ ground truth, transformed output of the student layer is rep- resented as ESh , transformed output of the teacher network is mAC = ð1 − βÞBðn, r Þ + βC AM ðr xπ , r π Þ, ð5Þ represented as EXh , note that this transformed module is pro- where Bðn, rÞ is the minimized cross-entropy and CAM is the duced from m-th module, ηh is a tunable balance factor Kullback-Leibler divergence (AM divergence). which balances the unlike losses, ρh ð·, · Þ is the m-th distilla- The rxπ ðkÞ in C AM ðv, rÞ is a distribution from the teacher tion loss to reduce the difference between ESh and EXh . model. Consider kf ∈ f0, 1g to be a Bernoulli random variable, Additionally, C AM ðr xπ , rπ Þ = Bðr xπ , rπ Þ − Bðr xπ Þ, where B and f is the f -th hybrid block in the hybrid network. Assum- ðr xπ Þ is a constant entropy. Therefore, Equation (4) can ing (kf = 1) as the student path selected and when the be reframed as the following equation. teacher path is chosen, it is assumed as (kf = 0). Suppose B f represents the input and B f +1 depicts the output of the mAC = ð1 − βÞBðn, r Þ + βðCAM ðrxπ , r π Þ + Bðr xπ , r π ÞÞ f -th hybrid block; hence, the yield f -th hybrid block can ð6Þ = ð1 − βÞBðn, r Þ + ðβBðr xπ , r π ÞÞ: be represented as follows: Setting the temperature π = 1 and mAC = Bð~ nx , rÞ, B f +1 = k f I Sf B f + 1 − β f I Xf B f : ð9Þ where n~ is as follows: x ~x ðaÞ = ð1 − βÞnðaÞ + βrx ðaÞ: n ð7Þ Here, I Sf ð⋅Þ is the functional representation of the student block and I Xf ð⋅Þ is the functional representation of the 3.3. Reverse Knowledge Distillation in Theory. Unlike teacher block.
Computational and Mathematical Methods in Medicine 5 Resnext29 baseline training Teacher KD–200 epochs Student resnext29 mobilenetv2 Phase 1 Teacher ReKD–200 epochs Student mobilenetv2 resnext29 KD–200 epochs Teacher Student resnext29 mobilenetv2 Phase 2 Teacher ReKD–200 epochs Student mobilenetv2 resnext29 Teacher KD–200 epochs Student Phase 3 resnext29 mobilenetv2 Figure 2: The proposed framework of knowledge distillation. Phase 1 Step 1 Knowledge distil mobilenetv2 - Teacher model: Resnext29 - train e epochs Step 2 Reverse Knowledge distil Resnext29 - Teacher model: Mobilenetv2 - train e epochs Phase 2 Step 1 Knowledge distil mobilenetv2 (phase1) - Teacher model: Resnext29 (phase1) - train e epochs Step 2 Reverse Knowledge distil Resnext29 - Teacher model: Mobilenetv2 - train e epochs Phase 3 Step 1 Knowledge distil mobilenetv2 (phase2) - Teacher model: Resnext29 (phase2) - train e epochs Algorithm 1: Bilateral knowledge distillation (base training: ResNeXt29 for e epochs). Taking the value of β f to be 1, Equation (2) takes the ing knowledge. It is already shown in the literature that the following form: teacher and students can distil knowledge to each other. In this novel approach, phase 2 comes into a picture, where the learnings of the teacher and student are fed back to each B f +1 = I Xf B f : ð10Þ other and improved accuracy at each phase. We have stopped in phase 3 where knowledge distillation through This simply means the hybrid block reduces to the orig- MobileNetV2 has produced state-of-the-art accuracy as dis- inal student block. Whereas if we consider β f = 0, Equation cussed in our results. This process can be iterated over several (2) can be formulated as follows: phases until we get the desired accuracy and depending on the computational resources available. The idea is further illus- B f +1 = I Xf B f : ð11Þ trated in Figure 3 with a schematic description. Now suppose δð:, :Þ is the task-specific loss, j represents 4. Results the ground fact ~jy denoting the predicted result of the hybrid network, and dy is the hybrid network, its optimized loss 4.1. Experimental Setup. For our experiments, we use two function can be written as follows: well-known transfer learning networks in the ImageNet dataset, ResNeXt29 and MobileNetV2 pretrained models. BACH 2018 dataset of breast histopathology images was d y = δ j, ~jy : ð12Þ trained on those networks for the training of histopathology images. PyTorch (build 1.9.0) with CUDA 10.2 platform was As per Figure 2 and Algorithm 1, we propose a bilateral used for the experiments. knowledge distillation regime. In phase 1, the more complex teacher model, the ResNeXt29, is used as a teacher model to 4.2. Baseline Teacher Training. Our baseline trainer training train the lightweight MobileNetV2. In step 2, the teacher and on ResNeXt29 (8 × 64 d with 34.53 M parameters) and the student are reversed just being interactive and exchang- MobileNetV2 (including the preliminary completely
6 Computational and Mathematical Methods in Medicine Phase Phase Phase 1 2 3 STEPl STEPl STEPl Knowledge distil Knowledge distil Knowledge distil mobilenetv2-teacher mobilenetv2 (phase1) mobilenetv2 (phase2) model: resnext29- -teacher model: -teacher model: train e epochs. resnext29 (phase1)- resnext29 (phase2)- train e epochs. train e epochs STEP2 STEP2 Reverse Knowledge distil resnext29- Reverse knowledge teacher model: distil resnext29- mobilenetv2-train e teacher model: epochs mobilenetv2-train e epochs Figure 3: Schematic flow diagram of bilateral KD as used in this paper. Train_accuracy Test_accuracy tag: Train_accuracy Test_accuracy 100 tag: Test_accuracy 80 50 60 40 40 0 40 80 120 160 200 0 20 40 60 80 100 120 140 160 180 200 Train_loss Train_loss Test_loss tag: Train_loss Test_loss tag: Test_loss 1.2 2.6 0.8 0.4 2.2 0 1.8 0 40 80 120 160 200 1.4 0 40 80 120 160 200 Figure 4: Train and test set accuracy of MobileNetV2.
Computational and Mathematical Methods in Medicine 7 Train_accuracy Test_accuracy tag: Train_accuracy tag: Test_accuracy 95 65 85 75 55 65 45 55 35 45 0 40 80 120 160 200 0 20 40 60 80 100 120 140 160 180 200 Train_loss Test_loss Train_loss Test_loss tag: Train_loss tag: Test_loss 5 1.2 4 0.8 3 0.4 2 0 1 0 40 80 120 160 200 0 40 80 120 160 200 Figure 5: Train and test set accuracy of ResNeXt29. convolution layer with 32 filters, accompanied by the way of reverse knowledge distillation on the given set of histological 19 residual bottleneck layers) indicates that the validation images. As such, MobileNetV2 now becomes the teacher accuracy for ResNeXt29 is as excessive as 75 percent wherein and ResNeXt29 becomes the student. The results were as that of MobileNetV2 is 60.71 percent as shown in improved as we got a validation accuracy of 65.58 percent. Figures 4 and 5. The results which are obtained through these experiments are shown in Figure 7. 4.3. Tuning Hyperparameters. Several hyperparameters were tuned as follows: Optimizer: SGD, momentum=0.9, weight_ 4.6. Bilateral Knowledge Distillation Results. From the results decay=5e-4, Learning Rate scheduler: learning rate warm up, of the above two experiments, we infer that the normal KD Defined phase as n: 3, Define epochs as e: 200, “alpha”: 0.95, or the reverse KD is not sufficient to give us the best result “temperature”: 20, “learning_rate”: 0.1, “batch_size”: 8, for computationally complex histological images. Our pro- “dropout_rate”: 0.5, “save_summary_steps”: 100, “num_ posed method of bilateral knowledge distillation, which workers”: 4. works on the notion of both teacher and student, interac- tively learns from each other and thereby improves each 4.4. Normal Knowledge Distillation Results. Since ResNeXt29 other’s accuracy. After three phases of interactive learning gave us higher accuracy in validation, we proceed by taking and exchanging information between teachers and the it as a teacher network and taking MobileNetV2 as the ligh- student, the MobileNetV2 reaches 96.3 percent accuracy on ter student network. As per literature, we confirmed that the a validation set, which we claim as a state-of-the-art result teacher model was able to distil the lightweight student for histological images of the breast. Results of training and model. However, since histological images are computation- validation using bilateral knowledge distillation are shown ally complex, the results were no more than 60.71 percent on in Figure 8. the validation set. The loss was considerable. The normal KD results are shown in Figure 6. 5. Discussion of Results 4.5. Reverse Knowledge Distillation Results. Again, we In bilateral cognitive distillation, we found that students and reversed the teacher and student networks to validate the teachers interact with each other. In noninteractive cognitive
8 Computational and Mathematical Methods in Medicine Train_accuracy Test_accuracy tag: Train_accuracy tag: Test_accuracy 50 54 46 50 42 46 38 42 34 38 30 34 0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180 Figure 6: Normal knowledge distillation results. Train_accuracy Test_accuracy tag: Train_accuracy tag: Test_accuracy 75 60 56 65 52 55 48 44 45 40 35 36 0 20 40 60 80 100 120 140 160 180 200 0 40 80 120 160 200 Figure 7: Reverse knowledge distillation result. distillation such as traditional cognitive distillation, the accuracies were limited. In normal KD, due to higher valida- teacher network sets only one goal to mimic the student net- tion accuracy, ResNeXt29, the more complex model has work, ignoring the communication of the student network. been chosen as the teacher network and the simple Mobile- However, since the characteristic transformation potential NetV2 as the corresponding student network and the of the student network is much lower than that of the teacher and student networks were reversed for Re-KD. teacher, it is impossible to mimic the exact knowledge that After three phases of interactive learning and exchanging the teacher has transferred to it. Therefore, there is a differ- information among teacher and the student, the Mobile- ence between the knowledge taken by the student and the NetV2 reaches 96.3 percent accuracy on validation set which knowledge transferred by the teacher, which limits KD per- we claim as a state-of-the-art result for histological images of formance. In interactive, bilateral cognitive distillation, the the breast. The limited accuracy obtained through these con- student takes action based on the problem they are facing, ventional methods prompted us to run the experiment with then the teacher gives feedback based on the student’s bilateral-interactive KD. As knowledge distillation is nothing actions, and finally, the student takes action according to but a regularization technique, the student network is the teacher feedback. In particular, we propose that the seldom overfitted and it can backpropagate the errors. teacher and students be interactive and that knowledge be Therefore, there is a constant feedforward and backpropaga- transferred from teacher to student and from student to tion across the different phases of the bilateral KD model. In student to teacher until the student who assesses the classes each phase, the labels are smoothed with regularization and reaches the allowable accuracy. We used two pretrained hence the forth accuracy improves with each phase. All the models trained on ImageNet datasets for our baseline train- results we discussed can be summarized in Table 2 below. ing, namely, ResNeXt29 and MobileNetV2, for validation of We also conclude that with the increase of temperature, theories of knowledge distillation on breast histological the distribution of soft targets in the teacher network mimics images. The histology biopsy images are too complex and the uniform distribution of label smoothing. The other have a similar colour combination when viewed under a important observation in this experiment is the fact that microscope after staining. As such even with robust bilateral KD does not need additional distillation losses networks like ResNeXt29 and MobileNetV2, the base line and is in accordance with the conventional KD method.
Computational and Mathematical Methods in Medicine 9 Train_accuracy 100 90 80 70 60 50 40 5 10 15 20 25 30 35 40 45 50 55 60 Test_loss 1.5 1 0.5 0 5 10 15 20 25 30 35 40 45 50 55 60 Figure 8: Results of training and validation using bilateral knowledge distillation. correspondence improves the accuracy in each subsequent Table 2: Results of all experiments performed. phase. In normal KD, due to higher validation accuracy Experiment Validation ResNeXt29, being the more complex model has been chosen Model as the teacher network and the simple MobileNetV2 as the performed accuracy corresponding student network and the teacher and the Baseline teacher ResNeXt29 75% student networks were reversed for Re-KD. The limited training MobileNetV2 60.71% accuracy obtained through these conventional methods ResNeXt29 teacher prompted us to run the experiment with bilateral- Normal KD 60.71% MobileNetV2 student interactive KD. The choice of transfer learning architecture was being validated by a baseline training. The result of MobileNetV2 teacher bilateral KD is that the MobileNetV2 network produced Reverse KD 65.58% ResNeXt29 student excellent accuracy in phase 3 of the validation. Being a light- Phase 1+phase 2 weight model, MobileNetV2 architecture will support Bilateral KD 96.3% deployment of the model in mobile applications also. The +phase 3 challenges of working with large-sized histological images and computational complexity can thus be avoided. This 6. Conclusion approach produces state-of-the-art results with an accuracy of 96.3 percent. It will also boost the performance of the stu- In this paper, we have discussed the results obtained in two dent model on complex and diverse cancer cell images. baseline papers on knowledge distillation and interactive Moreover, the model does not rely on copying the teacher knowledge distillation. We proposed a novel method called behaviour, but rather it can utilize the feature transforma- bilateral knowledge distillation on breast histological images. tional ability of the teacher. Going forward, we believe that In our approach, we took ResNeXt29 and MobileNetV2 as this paper will help researchers to work with histological the baseline pretrained models. The phenomenon that data of other cancers as well. knowledge distillation is nothing but a label smoothing regularization technique was established yet again through Data Availability the experiments conducted on histological image dataset of the breast. Additionally, the idea that label smoothing regu- Data will be made available on request. larization is an ad hoc KD is reestablished. The interactive bilateral KD model which we proposed conforms to the fact Conflicts of Interest that the teacher and student blocks interact with each other and is capable of transferring knowledge to each other The authors declare that there is no conflict of interest through the correspondence being established and that this regarding the publication of this paper.
10 Computational and Mathematical Methods in Medicine Acknowledgments hyperchaotic encryption,” Security and Communication Net- works, vol. 2021, Article ID 5536170, 2021. This research is self-funded. [15] T. Kausar, M. A. Ashraf, A. Kausar, and I. Riaz, “Convolution Neural Network based Approach for Breast Cancer Classifica- tion. Proceedings of 18th International Bhurban Conference References on Applied Sciences and Technologies,” in 2021 International Bhurban Conference on Applied Sciences and Technologies [1] P. Ratta, A. Kaur, S. Sharma, M. Shabaz, and G. Dhiman, (IBCAST), pp. 407–413, Islamabad, Pakistan, 2021. “Application of blockchain and Internet of Things in health- [16] D. Praveena Anjelin and S. Ganesh Kumar, “Blockchain tech- care and medical sector: applications, challenges, and future nology for data sharing in decentralized storage system,” in perspectives,” Journal of Food Quality, vol. 2021, Article ID Intelligent Computing and Applications (vol. 1172), Springer, 7608296, 2021. Singapore, 2021. [2] S. McGuire, “World Cancer Report 2014. Geneva, Switzerland: [17] L. J. Ba and R. Caruana, “Do deep nets really need to be deep?,” World Health Organization, International Agency for Advances in Neural Information Processing Systems, vol. 3, Research on Cancer, WHO Press, 2015,” Advances in Nutri- pp. 2654–2662, 2014. tion, vol. 7, no. 2, pp. 418-419, 2016. [18] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge [3] R. A. Smith, V. Cokkinides, and H. J. Eyre, “American Cancer in a neural network,” 2015, http://arxiv.org/abs/1503.02531. Society Guidelines for the Early Detection of Cancer, 2006,” CA: a Cancer Journal for Clinicians, vol. 56, no. 1, pp. 11–25, [19] M. K. Sharma, T. Perumal, and N. Dey, “Innovations in 2006. computational intelligence and computer vision,” in Advances in Intelligent Systems and Computing, p. 1189, [4] P. Filipczuk, T. Fevens, A. Krzyzak, and R. Monczak, “Com- Springer, 2021. puter-aided breast cancer diagnosis based on the analysis of cytological images of fine needle biopsies,” IEEE Transactions [20] C. Bucilǎ, R. Caruana, and A. Niculescu-Mizil, “Model com- on Medical Imaging, vol. 32, no. 12, pp. 2169–2178, 2013. pression,” in Proceedings of the 12th ACM SIGKDD Interna- tional Conference on Knowledge Discovery and Data Mining [5] M. Kowal, P. Filipczuk, A. Obuchowicz, J. Korbicz, and - KDD '06, pp. 535–541, New York, 2006. R. Monczak, “Computer-aided diagnosis of breast cancer based on fine needle biopsy microscopic images,” Computers [21] J. Dipalma, A. A. Suriawinata, L. J. Tafe, L. Torresani, and in Biology and Medicine, vol. 43, no. 10, pp. 1563–1572, 2013. S. Hassanpour, “Resolution-based distillation for efficient histol- ogy image classification,” 2021, http://arxiv.org/abs/2101.04170. [6] B. Wang, X. Yao, Y. Jiang, C. Sun, and M. Shabaz, “Design of a real-time monitoring system for smoke and dust in thermal [22] X. Dai, Z. Jiang, Z. Wu et al., “General instance distillation for power plants based on improved genetic algorithm,” Journal object detection,” 2021, http://arxiv.org/abs/2103.02340. of Healthcare Engineering, vol. 2021, Article ID 7212567, 2021. [23] S. Ge, S. Zhao, C. Li, and J. Li, “Low-resolution face recognition [7] X. Huang, V. Jagota, E. Espinoza-Muñoz, and J. Flores- in the wild via selective knowledge distillation,” IEEE Transac- Albornoz, “Tourist hot spots prediction model based on tions on Image Processing, vol. 28, no. 4, pp. 2051–2062, 2019. optimized neural network algorithm,” International Journal [24] T. He, C. Shen, Z. Tian, D. Gong, C. Sun, and Y. Yan, of System Assurance Engineering and Management, vol. 12, “Knowledge adaptation for efficient semantic segmentation,” 2021. in 2019 IEEE/CVF Conference on Computer Vision and Pat- [8] V. Jagota, A. P. S. Sethi, and K. Kumar, “Finite element tern Recognition (CVPR), pp. 578–587, Long Beach, CA, method: an overview,” Walailak Journal of Science & Technol- USA, 2019. ogy, vol. 10, no. 1, pp. 1–8, 2013. [25] T. K. K. Ho and J. Gwak, “Utilizing knowledge distillation in [9] L. Chen, V. Jagota, and A. Kumar, “Research on optimization deep learning for classification of chest X-ray abnormalities,” of scientific research performance management based on BP IEEE Access, vol. 8, pp. 160749–160761, 2020. neural network,” International Journal of System Assurance [26] B. H. Tang and J. Wang, “A physics-based method to retrieve Engineering and Management., vol. 12, 2021. land surface temperature from MODIS daytime midinfrared [10] D. C. Koboldt, R. S. Fulton, M. D. McLellan et al., “Compre- data,” IEEE Transactions on Geoscience and Remote Sensing, hensive molecular portraits of human breast tumours,” vol. 54, no. 8, pp. 4672–4679, 2016. Nature, vol. 490, no. 7418, pp. 61–70, 2012. [27] A. Garg, K. Aggarwal, M. Saxena, and A. Bhat, “Classifying [11] S. R. Lakhani, I. O. Ellis, S. J. Schnitt, P. H. Tan, M. J. van de medical histology images using computationally efficient Vijver, and World Health Organisation, Classification of CNNs through distilling knowledge,” in Emerging Technolo- Tumours of the Breast, WHO Press, 4th edition, 2012. gies in Data Mining and Information Security, J. M. R. S. [12] M. Veta, J. P. W. Pluim, P. J. Van Diest, and M. A. Viergever, Tavares, S. Chakrabarti, A. Bhattacharya, and S. Ghatak, “Breast cancer histopathology image analysis: a review,” IEEE Eds., pp. 713–721, Springer, Singapore, 2021. Transactions on Biomedical Engineering, vol. 61, no. 5, [28] J. J. Thiagarajan, S. Kashyap, and A. Karargyris, “Distill-to- pp. 1400–1411, 2014. label: weakly supervised instance labeling using knowledge dis- [13] Z. Han, B. Wei, Y. Zheng, Y. Yin, K. Li, and S. Li, “Breast can- tillation,” in 2019 18th IEEE International Conference on cer multi-classification from histopathological images with Machine Learning and Applications (ICMLA), pp. 902–907, structured deep learning model,” Scientific Reports, vol. 7, Boca Raton, FL, USA, 2019. no. 1, 2017. [29] S. Fu, Z. Li, J. Xu, M.-M. Cheng, Z. Liu, and X. Yang, “Interactive [14] C. Sharma, B. Amandeep, R. Sobti, T. K. Lohani, and knowledge distillation,” 2020, http://arxiv.org/abs/2007.01476. M. Shabaz, “A secured frame selection-based video water- [30] Y. Li, “Performance evaluation of machine learning methods marking technique to address quality loss of data: combining for breast cancer prediction,” Applied and Computational graph-based transform, singular valued decomposition, and Mathematics, vol. 7, no. 4, p. 212, 2018.
Computational and Mathematical Methods in Medicine 11 [31] L. Yuan, F. E. H. Tay, G. Li, T. Wang, and J. Feng, “Revisiting knowledge distillation via label smoothing regularization,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3902–3910, Seattle, WA, USA, 2020. [32] G. Aresta, T. Araújo, S. Kwok et al., “BACH: grand challenge on breast cancer histology images,” Medical Image Analysis, vol. 56, pp. 122–139, 2019.
You can also read