Edited by: Steven Fernandes, Creighton University, United States
Reviewed by: Reham Reda Mostafa, Mansoura University, Egypt; Geno Peter, University College of Technology Sarawak, Malaysia; Feras Alattar, National University of Science and Technology, Oman
This article was submitted to Digital Public Health, a section of the journal Frontiers in Public Health
†These authors have contributed equally to this work and share first authorship
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Brain tumor diagnosis has been a lengthy process, and automation of a process such as brain tumor segmentation speeds up the timeline. U-Nets have been a commonly used solution for semantic segmentation, and it uses a downsampling-upsampling approach to segment tumors. U-Nets rely on residual connections to pass information during upsampling; however, an upsampling block only receives information from one downsampling block. This restricts the context and scope of an upsampling block. In this paper, we propose SPP-U-Net where the residual connections are replaced with a combination of Spatial Pyramid Pooling (SPP) and Attention blocks. Here, SPP provides information from various downsampling blocks, which will increase the scope of reconstruction while attention provides the necessary context by incorporating local characteristics with their corresponding global dependencies. Existing literature uses heavy approaches such as the usage of nested and dense skip connections and transformers. These approaches increase the training parameters within the model which therefore increase the training time and complexity of the model. The proposed approach on the other hand attains comparable results to existing literature without changing the number of trainable parameters over larger dimensions such as 160 × 192 × 192. All in all, the proposed model scores an average dice score of 0.883 and a Hausdorff distance of 7.84 on Brats 2021 cross validation.
香京julia种子在线播放
Brain tumor segmentation using magnetic resonance images (MRI) is a vital step for treating tumors present in the brain and a specialist can use this to find the damage caused by a tumor in a region. The most frequent and severe malignant brain tumors are glioblastomas, often known as gliomas (GBM). Magnetic resonance imaging (MRI) with automated and exact segmentation of these malignancies is critical for early diagnosis as well as for administering and monitoring treatment progression. Assessment of tumor presence is the first step in brain tumor diagnosis and the assessment is done on the basis of segmentation of tumors present in MRI. This process is often done manually making it a time and human intensive task. Moreover, tumors exist in different forms and sizes making it a task requiring expertise. The process of assessment can be sped up by automating the segmentation of brain tumors (
The Brain Tumor Segmentation Challenge (BraTS) (
The peritumoral edematous/invaded tissue (ED-label 2), the Gd-enhancing tumor (ET-label 4), and the necrotic tumor core are the tumor sub-regions for each patient (NCR-label 1). The peritumoral edematous and infiltrated tissue known as ED has an infiltrative non-enhancing tumor as well as peritumoral vasogenic edema and is linked with an abnormal hyperintense signal envelope on the T2 FLAIR volumes. ET stands for the tumor's enhancing segment and is identified by T1Gd MRI regions that exhibit some enhancement. On T1Gd MRI, the necrotic core of the tumor, or NCR, seems to be substantially less intense.
The different views of brain MRI slices with annotations.
In many vision tasks like segmentation, particularly in the healthcare industry, deep learning-based segmentation systems have shown amazing success, outperforming other traditional methods in brain tumor analysis (
U-Nets consist of residual connections, and these connections are key for reconstruction. These connections pass local and global information to a particular decoder (
Segmentation maps have been formed using a 3D U-Net which consists of three downsampling and upsampling blocks followed by a set of convolutional layers. The authors use a patching approach to train the model (
Wang et al. (
Hence, we were able to identify some research gaps:
As can be seen, existing literature uses heavy approaches such as the usage of nested and dense skip connections and transformers. Hence an approach which considers the parameters in mind is needed. Considering applications such as edge computing which heavily emphasize efficient and accurate predictions, the proposed mechanism fits such problem statement in hand.
Moreover, the skip connections have always been an aspect of the experimentation. Additional information to the decoder layers through mechanisms such as ASPP has given performance improvement. Hence utilizing a similar mechanism on multiple encodings in a 3-dimensional manner seemed to be an idea for the research.
We propose a U-Net with SPP and attention. SPP takes information from three encoder layers and passes it to the decoder in the U-Net. The proposed addition provides the model with additional context and information for better reconstruction by providing scope from neighboring layers. The proposed mechanism does not have additional training parameters therefore the need for computational power remains the same. Therefore, the resultant model is lightweight in nature aiding for faster medical diagnosis and medical workflow in a production environment. To introduce reproducibility, the codebase utilized has been made public:
The dataset used was Brats 2021. The MRI scans were firstly bought across to a common dimension of 160 × 192 × 192. This size was arrived upon based on experimentation and the comparison was done on the basis of Dice Score (further discussed in results). Image flip Brightness adjust Rotation: Images can be rotated on the Elastic transformation Intensity shift
Note that the choice of augmentations, within this set, used are random hence this makes the model robust to overfitting. The following is achieved by randomly choosing the augmentations on the basis of a threshold.
Data split for brats 2021.
Train split | 1,000 |
Validation split | 250 |
Spatial Pyramid Pooling (
Atrous Spatial Pyramid Pooling was proposed based on SPP and carries the concept of SPP by using parallel Atrous Convolutional layers. ASPP has been extensively used in semantic segmentation, and it serves the purpose of providing context at different levels or views. ASPP has been employed in various studies within brain segmentation. However, as per Tampu et al. (
Attention is a process through which we humans put forth focus on doing certain tasks. While reading, we capture context by understanding neighboring words within a sentence. This mechanism is applied to the attention layer and its purpose is to capture context. The attention layer has been extensively employed in deep learning and has contributed to cutting-edge outcomes. Attention is obtained for the model by combining the output of two encoder layers. By feeding the output of two encoder layers into two different 3D convolutional layers, the following is accomplished. The output of the two layers is combined, and relu is then used to activate it. The activated output is passed through a 3D convolutional layer and is then normalized and activated. Fusing the output of these layers along with activation aids in maintaining context while not compromising on the dimensionality aspect.
Hence, SPP is used as a feature aggregator within the model, and to introduce context, attention layers are employed. The SPP layer, along with the attention layer, have been used to replace some residual connections within the U-Net. SPP is typically used at the end of the process, after the feature maps have been flattened so that fully connected neural networks can use the maps to predict class(es) or bbox(es). A 3D convolutional layer with a kernel size of 1 is utilized to modify SPP so that it functions as a residual connection. The output of the SPP is again converted to a 3D representation by this layer. Additionally, by sending input from many encoder levels to each pooling layer, information is gathered over a wide range.
The architectural diagram for SPP.
The U-Net used is based on the NvNet (
In this paper, experimentation is done on 3 model architectures based on
No SPP: The SPP blocks would be omitted therefore boosting the model only in terms of context.
1 SPP: The upper SPP block would be removed from the architecture keeping only the lower block (with 3rd encoder layer). Hence boosting the model with a combination of context and features.
2 SPP: Both the SPP blocks were used. This model carries more feature boost from the other two models used.
The architecture of U-Net used.
Based on our experimentation we found the issue of gradient explosion hence group normalization was employed. Batch Normalization did not work in our cases as high batch size could not be used for training. Batch size >1 did not provide the expected results and at times would also result in the GPU running out of memory and process killing. Hence, the choice for batch size was kept as 1. A Nvidia Tesla V100 GPU was utilized for the training of the models.
Hyperparameters used for training.
Image size (channels * length * width) | 160 × 192 × 192 |
Epochs | 60 |
Learning rate | 2.50E-04 |
Weight decay | 1.00E-07 |
Scheduler | Cosine annealing LR |
Criterion | Dice loss |
Optimizer | Adam |
Normalization | Group norm |
The evaluation process of the model has been done on the basis of cross-validation and the model was evaluated on two metrics:
Dice Score (as showing in Equation 2): In short it is the F1-Score conveyed on behalf of image pixels: Wherein the ground truth is the annotated pixels. Dice score is an efficient metric as it penalizes false positives: If the predicted map has large false positives, it is used in the denominator rather than the numerator.
Hausdorff Distance: The Hausdorff distance (
It should be noted that Hausdorff Distance is unconcerned with the size of the image's background. By calculating the extensive distance between the extremes of the two outlines, the Hausdorff distance complements the Dice metric. A prediction may show nearly voxel-perfect overlap since it severely penalizes outliers, but the Hausdorff distance will only be meaningful if a certain voxel is far from the reference segmentation. This statistic is quite useful for determining the clinical importance of segmentation, despite being noisier than the Dice index.
Two models from the Brats 2021 dataset, as well as one model from each of the Brats 2020 and Brats 2019 datasets, are compared to the suggested model.
As per
The model with 1 SPP block performed the best amongst the architectures proposed. Therefore, it can be deduced that the boosting of context and features go hand in hand.
In the case of Hausdorff Distance, all the three models have the lowest metric when the class is Enhancing Tumor.
In the case of Dice Score, both No SPP and 2 SPP models achieve similar results in Whole Tumor and Tumor Core. Both the models outperform 1 SPP in Whole Tumor however lose out to 1 SPP in Tumor Core. All the models achieve the lowest Dice Score in the case of Enhancing Tumor.
When comparing models trained on Brats 2019 and Brats 2020, the proposed work outperforms the model however this is a general trend.
With respect to Brats 2021, the proposed work gives comparable results to Hatamizadeh et al. (
Results obtained in standard setting.
No SPP | 13.070 | 11.010 | 10.210 | 11.430 | 0.908 | 0.877 | 0.838 | 0.870 |
1 SPP | 9.430 | 7.780 | 6.300 | 7.840 | 0.899 | 0.899 | 0.850 | 0.883 |
2 SPP | 16.060 | 5.650 | 5.270 | 8.990 | 0.904 | 0.880 | 0.845 | 0.876 |
Hatamizadeh et al. ( |
4.739 | 15.309 | 16.326 | 12.120 | 0.927 | 0.876 | 0.853 | 0.890 |
Jia and Shu ( |
3.000 | 2.236 | 1.414 | 2.220 | 0.926 | 0.935 | 0.887 | 0.920 |
Qamar et al. ( |
– | – | – | – | 0.875 | 0.837 | 0.795 | 0.840 |
Jiang et al. ( |
4.610 | 4.130 | 2.650 | 3.800 | 0.888 | 0.837 | 0.833 | 0.850 |
The model was also trained on image size of 160 × 160 × 160. This was done to understand the impact of a smaller image size.
Results obtained with image size 160 × 160 × 160.
No SPP | 34.1 | 7.97 | 7.13 | 16.4 | 0.895 | 0.872 | 0.837 | 0.868 |
1 SPP | 18.6 | 6.13 | 4.88 | 9.87 | 0.887 | 0.879 | 0.842 | 0.869 |
2 SPP | 20.12 | 7.42 | 6.22 | 11.25 | 0.886 | 0.876 | 0.843 | 0.868 |
The following inferences can be made from 0.01 was the difference in average dice score between the models trained on different image sizes. However, a significant difference was observed in the Hausdorff distances. Again 1 SPP model performed the best but the margin of difference was next to none. Hence we can infer that a high image size is a key contributor to increase performance when SPP is utilized and the following inference proves the point of passing higher resolution features through residual connections.
The trend in the metrics can be seen as shown in
Lastly, the output maps from the model are analyzed in
Prediction vs. ground truth segmentation mask comparison.
We propose U-Net with SPP and Attention Residual Connections in this work. The proposed model attachment is a lightweight mechanism which boosts information and context in the model by passing high and low resolution information to the decoders in the Unet. The proposed mechanism is applied to the NvNet model in varying frequencies which then produces different variants: Model with attention, Model with attention and 1 SPP, and Model with attention and 2 SPP. The model with 1 SPP and attention performs the best and provides comparable results to heavy models with transformer residual attachments. The average Dice Score and Haussdorf distance for the model with 1 SPP and attention are 0.883 and 7.99, respectively. The proposed mechanism is an approach to boost information and context hence giving considerable performance boosts. This approach plays well in applications such as edge computing which requires a balance of computational efficiency and performance. Such an approach could be utilized in mobile healthcare stations which need immediate diagnosis with less computation power. However, the impact of performance improvement at times falls a bit short compared to heavy approaches and it boils down to the extra trainable parameters brought by the components which eventually capture more patterns. In the current work, the mechanism is only adapted to one particular model and in the future, we aim to make the mechanism adaptable to various other 3D-Unet architectures.
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.
KS and C-YC conceptualized and supervised the research, carried out the project administration, and validated the results. SV and TG contributed to the development of the model, data processing, training procedures, and the implementation of the model. SV, TG, and KS wrote the manuscript. SV, TG, KS, PV, and C-YC reviewed and edited the manuscript. C-YC carried out the funding acquisition. All authors contributed to the article and approved the submitted version.
This research was partially funded by Intelligent Recognition Industry Service Research Center from the Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan and Ministry of Science and Technology in Taiwan (Grant No. MOST 109-2221-E-224-048-MY2).
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.