Back to Journals » Cancer Management and Research » Volume 12

Localization of Nuclei in Breast Cancer Using Whole Slide Imaging System Supported by Morphological Features and Shape Formulas

Authors Kumar A , Prateek M 

Received 3 February 2020

Accepted for publication 25 May 2020

Published 16 June 2020 Volume 2020:12 Pages 4573—4583


Checked for plagiarism Yes

Review by Single anonymous peer review

Peer reviewer comments 2

Editor who approved publication: Dr Eileen O'Reilly

Anil Kumar, Manish Prateek

School of Computer Science, University of Petroleum and Energy Studies, Dehradun 248007, India

Correspondence: Anil Kumar Email [email protected]

Purpose: Cancer rates are exponentially increasing worldwide and over 15 million new cases are expected in the year 2020 according to the World Cancer Report. To support the clinical diagnosis of the disease, recent technical advancements in digital microscopy have been achieved to reduce the cost and increase the efficiency of the process. Food and Drug Administration (FDA or Agency) has issued the guidelines, in particular, the development of digital whole slide image scanning system. It is very helpful to the computer-aided diagnosis of breast cancer.
Methods: Whole slide imaging supported by fluorescence, immunohistochemistry, and multispectral imaging concepts. Due to the high dimension of WSI images and computation, it is a challenging task to find the region of interest (ROI) on a malignant sample image. The unsupervised machine learning and quantitative analysis of malignant sample images are supported by morphological features and shape formulas to find the correct region of interest. Due to computational limitations, it starts to work on small patches, integrate the results, and automated localize or detect the ROI. It is also compared to the handcrafted and automated region of interest provided in the ICIAR2018 dataset.
Results: A total of 10 hematoxylins and eosin (H&E) stained malignant breast histology microscopy whole slide image samples are labeled and annotated by two medical experts who are team members of the ICIAR 2018 challenge. After applying the proposed methodology, it is successfully able to localize the malignant patches of WSI sample images and getting the ROI with an average accuracy of 85.5%.
Conclusion: With the help of the k-means clustering algorithm, morphological features, and shape formula, it is possible to recognize the region of interest using the whole slide imaging concept.

Keywords: unsupervised machine learning, morphological features, shape formulas, ROI, WSI, H&E stained images, breast cancer


With a million new cases being reported every year, cancer seems to be tightening its grip throughout the world and especially Africa, Asia, and Central and South America. Breast cancer is one of the major causes of cancer-related death in women of all ages worldwide.1 Early diagnosis and treatment of breast cancer remarkably obstruct the disease’s progression and reduce its morbidity rate.2 Whole Slide Imaging is one of the emerging fields of digital pathology based on digital microscopic. It explores the different methods and applications to enhance clinical care and cancer diagnosis. The first automated high-dimensional whole slide imaging (WSI) system was developed by Wetzel and Gilbertson in 1999. After so many innovations in imaging hardware, methods, and applications practitioners from digital pathology are adopted these technological advancements and steadily growing.3 Besides the conventional approach of pathology, WSI produces the virtual slides with a high resolution of less than 0.5µm/pixels and it can be examined and explored with interactive software on a good computer screen.4

WSI concept carries exceptional promises to digital pathology. But in parallel, this concept is limited by several factors, including image quality, the inability to view the entire slide on high resolution, navigation control, and extended amount of time to review the slide with accuracy and adaptability for the system. Pathologists with medical specialists facing the growing demand to improve quality, patient safety, and better accuracy of diagnosis with high precision. These factors motivating the developers to build systems that can optimize access to expert’s opinions and highly specialized pathology services. Digital pathology networks based on WSI systems provide a potential solution to all of these challenges and will undoubtedly play a critical role in the future. It is focused primarily on the adaptation of such kind of concepts in pathology and as a result, so many hospitals and agencies accepting it. The growing research and literature describe the validation of the WSI system. However, due to size in gigapixel and high computational cost, it is always a big challenge to find the ROI in WSI. The patch level classifier has been implemented after giving training on image patches rather than image-level using CNN.5

Much related work has been done in the past few years and most of them are machine learning-driven approaches. It is mentioned that it has been trained in the model using CNN with the patches size 500x500 that were extracted from large WSI.5 After applying patch extraction, segmentation, and likelihood in Expectation-Maximization (EM) based method to identify the appropriate patches for CNN training. It is mentioned the WSI enabled the researchers to view digital slides and gain a new understanding of cancer diagnosis and decision-making systems. With the help of a sliding window and a visual bag of words approach, it is formulated the detection of relevant ROIs with an accuracy of 74%.6 Other algorithms like patch-based nonlinear image registration are applied to human lung cancer detection. It is proposed a two-state solution, first registers the complete image into low resolution with a nonlinear deformation model then refines this result on patches with high-resolution images by using a second nonlinear registration on each patch. The information of those patches that are already computed in the first step is used consequently for the second step of high-resolution patches.7 It is proposed and described two different spatial pyramid matching approaches based on morphometric features and morphometric sparse code, respectively, for tissue image classification. It is the extension of pixel-level feature extraction to the patch level feature extraction. It is extensively working for different types of tumors.8

Materials and Methods

WSI Samples

ICIAR 2018 Grand Challenge on breast cancer histology images (BACH) dataset is composed of hematoxylin and eosin (H&E) stained histology microscopy whole slide images.9 Total 30 WSI for training and 10 WSI available for algorithm testing. All the 10 WSI malignant samples have pixel-wise annotated regions for the Benign, in situ carcinoma and invasive carcinoma classes, labeled through pathologists. All the whole slide images are in.svs format having an RGB color model, with a pixel scale of 0.467µm/pixels and a variable size (eg 42113x62625), acquired by Leica SCN400. It is used the python programming language (Python 3.6, 32-bit) and is supported by open-source library packages: NumPy, sklearn, OpenCV, imutils, scikit-image, OpenSlide, and matplotlib.


All the major steps involved to detect accurate ROI of WSI sample images with the help of image segmentation and counting of nuclei to target the discriminative regions in the proposed methodology (Figure 1). It starts with the scanned microscopy WSI images as an input. Due to the high resolution of the WSI sample images, computationally, it is very difficult to compute the whole image at a time. The resolution of the sample image is 15368 x 17496 with three color channels (Figure 1). It is better to split it into the number of patches. Therefore to make less computation the sample image has been split into 64 individual patches with each dimension of 1921 x 2187 (Figure 2A and B). Still, if the dimension is more, so it can be split further till the actual scope of the scanner. Now every patch is segmented through one of the efficient unsupervised learning i.e, k-means clustering algorithms.10 It suppresses the anomalies and smoothened the sample image.

Algorithm 1: k-means clustering


k: the number of clusters,

D: a data set containing n objects.

Output: A set of k clusters.

  1. arbitrarily choose the cluster centroids , …, as k objects from D
  2. repeat
  3. (re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster;
  4. update the cluster means, that is, calculate the mean value of the objects for each cluster;
  5. until no change;

Figure 1 Workflow of proposed methodology.

Figure 2 Split high resolution WSI sample image (A) into low dimensional patches in (B).

After removing anomalies and smoothing the sample image in the form of patches, now convert it into an 8-bit gray-level image. In the next consecutive step, apply the Otsu’s method for thresholding.11,12 It is one of the most efficient methods to get thresholding value and always applied on the gray-level histogram. Let the pixels of a given picture will be represented in L gray levels [1, 2 … L]. The number of pixels at level i is denoted by ni and the total number of pixels by N = n1 + n2 + … + nL. To simplify the discussion, the gray level histogram is normalized and regarded as a probability distribution:


After applying the iterative approach, it represents the Otsu’s method that is within-class variance represented as the sum of the two variances multiplied by their associated weights.



= The variance of the pixels in the background

(below threshold)

= The variance of the pixels in the foreground (above threshold)

Otsu’s method and Canny edge detector helped to find the global threshold value and the edges respectively in the patches (Figure 3C). The morphological features and different shape formulas will help to find the discriminative region of interest for the different nuclear section resides on sample data. Nuclear section area, roundness, solidity, and compactness have a very important role to recognize the nuclear section from the different resilient environments of the sample images. The process of finding the approximate size of the nuclear section is initiated by the experts (pathologists), after that, the system will take care of it. Finally, it will get the effective and approximately accurate counting of the nuclei present in the individual patches. Based on nuclei counting, it is easy to define the appropriate ROI in the patches.


Figure 3 (A) Implementation of k-means clustering algorithm on an image patch at magnification level of x40 and (B)results after clustering at same magnification level.

where B is representing the dimension of an image in terms of row and columns.


where P is the perimeter of the nuclear section.




The proposed methodology has been implemented in Python Programming Language and supported by open-source library packages. The data of ICIAR 2018 grand challenge is acquired on breast cancer histology images.9 All the samples are H&E stained WSI images of high resolution. It has provided a total of 40 WSI samples and approximately 400 labeled microscopy images. Among 40 WSI, 30 are for the training, and 10 data samples are for testing.

It has been demonstrated the implementation of the k-means clustering algorithm on a sample patch (Figure 4A). The anomalies have been suppressed and the nuclear section becomes more visible (Figure 4B). In continuation of the final result, it is demonstrated that the output of k-means clustering is converted into an 8-bit gray level image (Figure 3A and B). Apply the Otsu’s method for finding appropriate edges. The Canny edge detection algorithm is implemented in Figure 3C. The most important step is to target the nuclear section in the patch and its accurate counting. Segmentation and localization of nuclei are demonstrated in (Figure 3D).

Figure 4 Implementation steps and results at magnification level of x40 (A) represents the ouput of k-means clustering algorithm, (B) represents the 8-bit gray level image, (C) represents the output of Otsu's method and canny edge detector, and (D) represents the localization of nuclei.

This work is motivated by WSI concepts following the DICOM standard, where it is studied 720 high-dimensional pathology images of tumor tissue from a total of 352 patients with breast cancer and quantified the total 35 biomarkers.13 A similar kind of work is described, where the (bag of words) model is applied to predict diagnostically relevant regions in unseen whole slide images and achieved a 75% detection accuracy.14 The same kind of work is dealing with feature extraction methodology and it has a major role to target the point of interest based on defined features like the area of the nuclear section, its roundness, solidity, and compactness.15

It is shown the steps and corresponding results to target and localize the region of interest (ROI) (Figure 5). It is represented the input sample image in (Figure 5A) and split into 64 equal patches (Figure 5B). The quantitative representation of every low-resolution patches of (Figure 5B) is represented in (Figure 5C). It is observed that the frequency of targeted nuclei is the maximum of 375 for the patch location (1, 2) (Figure 5D). The frequency of the other patches is relatively very low. It is used statistical formulas like mean, standard deviation, and variance to select the threshold frequency. So the patch of location (1, 2) will be eligible to become the region of interest. To validate the results, this work is performed on ICIAR 2018 grand challenge on breast cancer histology images. This dataset is composed of hematoxylin and eosin (H&E) stained breast histology microscopy and whole slide images. There are various data samples but this work is targeting only 10 pixel-wise labeled WSI malignant samples. In the result section, it is demonstrated only 6 WSI images (Figure 6A). All the available 10 samples are labeled and annotated by two medical experts who are the team members of the challenge. Each image is labeled by benign, in-situ carcinoma, and invasive carcinoma. The remaining unlabeled part of the image will be considered as normal. Before implementing the work of this paper, it is set together all the three categories benign, in-situ carcinoma and invasive carcinoma in one category called affected area or targeted region of interests (ROIs), represented by red color and the rest in normal category, represented by black color. All the targets are followed by experts’ suggestion (Figure 6B).

Figure 5 Steps and results of targeting ROI (A) high resolution H&E stained sample image, (B) split it into equal low dimension patches, (C) counting of nuclei present in each patch and (D) represents pixel-wise points of interest (nuclei) at magnification level of x40.

Figure 6 Targeted ROI on WSI image samples (A) represents WSI malignant samples of breast cancer in.svs format with a pixel scale of 0.467µm/pixels, (B) represents labeled and annotated by medical experts of the BACH challenge, and (C) represents automated localized ROI based on counting of nuclei.

After following all the steps (Figure 1), it is found the automated localized region of interest in (Figures 6 and 7). It is considered a threshold frequency calculated by statistical formulas like mean, standard deviation, and variance for the frequency of all the patches (Figure 7B and C) derived in Figure 7D. The resultant threshold frequency will help to select the interesting patches and it can be easily localized and compared with ground truth (Figure 7E) and the predicted region of interests (Figure 7F). Various mathematical relevant measures are used to evaluate the accuracy of semantic segmentation model listed in Table 1.16 After applying the various measures, the average accuracy of all the 10 WSI malignant samples are calculated in percentage except Kappa Score. The value of the kappa score ranges from 0 to 1. It is observed that the average values are 2.3%, 1.5%, 88%, 90.04%, 0.69, 86.3%, and 85.5% respectively for MSE, RMSE, SSIM, Pixel Accuracy, Kappa Score, F1 Score, and IoU (Table 2). Dice coefficient (F1 Score) and IoU are the well known efficient measures for the evaluation of segmentation accuracy. The Dice coefficient is very similar to IoU and their segmentation accuracy range from 0 to 1, but here it is measured in percentage. Rezatofighi et al17 discussed the importance of intersection over union (IoU). It is one of the most popular benchmarks for object detection. IoU is the commonly used metric for comparing the similarity between two arbitrary shapes using the overlapping concept.

Algorithm 2: Generalized Intersection over Union (IoU)

Input: Two arbitrary shapes:

Output: IoU

  1. For A and B, find the smallest enclosingconvex object C, where

Table 1 Mathematical Definition/Formula of Relevant Measures

Table 2 Comparison of Different Accuracy Measure Implemented on Proposed Segmentation Results

Figure 7 Steps implemented on BACH high resolution WSI sample images (A) H&E stained malignant sample image, (B) split it into low dimension, (C) target the patch based on highest counting of nuclei (D) using algorithm, count the number of nuclei for each patch, (E) manually annotated by experts, and (F) automated annotation of target patches.

It is illustrated and manually annotated the region of interest in (Figure 6B) and their corresponding automated localized ROI is illustrated in (Figure 6C). It is applied Algorithm 2 for calculating the IoU on the BACH data set which is containing 10 annotated WSI samples and getting the average accuracy of 85.5% which can be called a good result in comparison to others listed in Table 3. It is discussed the quantitative accuracy comparison of different methods. This result is based on splitting the WSI samples into 64 different patches. If it will be further split, definitely the accuracy will increase but the computational cost will also be increased.

Table 3 Quantitative Accuracy Comparison of Different Method vs Paper Results


This study will help the pathologist to diagnosis the cancer patient effectively with more precision and lesser the multiple opinions. The pathologist spends most of their time on the diagnosis of sample tissues. Due to the complexity of visualization and multiple regions of interest, it consumes so much time and effort. Mercan et al6 used the visual bag of words model with texture and color features to describe these regions and train probabilistic classifiers to predict similar regions of interest in the new whole slide image. This paper used 240 different WSI samples of breast biopsies from 5 different levels of cancer from normal to malignant. And getting 79.8% accuracy to find the correct ROI. Apou et al18 described a fast segmentation method coupled with an intuitive multiclass supervised classification that captures expert knowledge presented as morphological annotations to establish a cartography of a WSI and highlight biological regions of interest. Zhang et al19 also discussed the whole slide cancer diagnosis with a deep learning algorithm. It proposed the method which is mastering the ability to automate the human-like diagnostic reasoning process and translate gigapixels directly to a series of interpretable predictions, providing second opinions and encouraging consensus clinical pathology. Guo et al20 applied the v3 DCNN model and getting FROC of 83.5%. In this study, it is used Camelyon16 dataset automatically produce a heatmap of WSI and extract polygons of lesion regions for doctors.

All the below studies are implemented on the ICIAR 2018 BACH WSI dataset. Marami et al21 proposed an automated classification method for identifying the micro-structures of tissues using an ensemble of convolutional neural networks which has an accuracy of 55.26%. Aresta et al22 proposed the algorithm of classification and localization for clinically relevant histopathological classes in microscopy and WSI annotated data set. The submitted algorithm was the improved version of state-of-the-art convolutional neural networks and achieved an average accuracy of 69% for automatic identifying the RoI and classify it. Nawaz et al23 have tried to reduce the cost of the collection of medical data by applying some clever tricks such as mirroring, rotating, and fine-tuning of pre-trained networks. In continuation of this work, it is fine-tuned a deep convolutional neural network (ALEXNET) and achieved an average accuracy of 75.73%. Golatkar et al24 proposed the algorithm after a fine-tuning Inception-v3 convolutional neural network. It is extracted the patches based on nuclear density and it rejects the patches that are not rich in nuclei. Every patch with high nuclear density is accepted and based on majority voting, it defines the nuclear classes with an average accuracy of 79%. Roy et al25 proposed a patch-based classifier using CNN for automatic classification WSI dataset. The patch-based classifier first predicts the class label of each patch by one patch in one decision (OPOD) and after applying the majority voting schemes, it is taken the final decision about the final class label of the WSI sample image. The average patch wise classification accuracy of the algorithm is 81.05%. Yan et al26 proposed a new hybrid convolutional and recurrent deep neural network for the classification of breast cancer histopathological images. The algorithm is based on multilevel feature representation and integrated with the advantages of convolutional and recurrent neural networks. It preserves the short term and long term spatial correlation between patches. It obtained an average accuracy of 82.1% for the normal class.


One of the major challenges of all neural network learning-driven approaches is the availability of labeled data and it must be authentic. Every time it is required to tune the neural network classification model for different datasets. This study is done on the ICIAR 2018 Grand Challenge BACH (breast cancer histology images) dataset. The proposed methodology applied on a total of 10 WSI annotated testing malignant samples and it is successfully localized the region of interest with an accuracy of 85.5%. It is based on unsupervised machine learning supported by morphological features and shape formulas using IoU. The proposed study focus to localize the region of interest so that it will help the pathologist to take the correct and timely decision related to the level of malignancy and for further treatment. The result can be improved if there will be a sufficient number and diverse annotated WSI sample datasets. Of course advances in hardware are equally important.


WSI, whole slide image; ROI, region of interest; H&E, hematoxylin and eosin; FDA, Food and Drug Administration; BACH, breast cancer histology images; EM, expectation-maximization; IoU, intersection over union; DICOM, digital imaging and communications in medicine; HIC, histopathological image classification; CBHIR, content-based histopathological image retrieval; FROC, free response receiver operating characteristic; DCNN, deep convolution neural network; ICIAR, International Conference on Image Analysis and Recognition.

Ethics Approval and Consent

Not required as the data set is available online and can be used by anyone for research purposes.


The authors report no funding and no conflicts of interest in this work.


1. Siegel RL, Miller KD, Jemal A. Cancer statistics, 2017. 2017;67(1):7–30.

2. Smith RA, Cokkinides V, Eyre HJ. American Cancer Society guidelines for the early detection of cancer, 2006. CA Cancer J Clin. 2005;55(1):31–44. doi:10.3322/canjclin.55.1.31

3. Ghaznavi F, Evans A, Madabhushi A, Feldman M. Digital imaging in pathology: whole-slide imaging and beyond. Annu Rev Pathol Mech Dis. 2013;8:331–359. doi:10.1146/annurev-pathol-011811-120902

4. Gilbertson JR, Ho J, Anthony L, Jukic DM, Yagi Y, Parwani AV. Primary histologic diagnosis using automated whole slide imaging: a validation study. BMC Clin Pathol. 2006;6:4. doi:10.1186/1472-6890-6-4

5. Hou L, Samaras D, Kurc TM, Gao Y, Davis JE, Saltz JH. Patch-based convolutional neural network for whole slide tissue image classification. Proceedings of the IEEE conference on computer vision and pattern recognition; 2016:2424–2433. arXiv:1504.07947v5 [cs.CV].

6. Mercan E, Aksoy S, Shapiro LG, Weaver DL, Brunyé TT, Elmore JG. Localization of diagnostically relevant regions of interest in whole slide images: a comparative study. J Digit Imaging. 2016;29(4):496–506. doi:10.1007/s10278-016-9873-1

7. Lotz J, Olesch J, M¨uller B, et al. Patch-based nonlinear image registration or gigapixel whole slide images. IEEE Trans Biomed Eng. 2016;63(9):1812–1819. doi:10.1109/TBME.2015.2503122

8. Chang H, Borowsky A, Spellman P, Parvin B. Classification of tumor histology via morphometric context. IEEE; 2013. doi:10.1109/CVPR.2013.286

9. ICIAR 2018: grand challenge on breast cancer histology images. Available from: Accessed March 11, 2018.

10. Han J, Kamber M, Pei J. Data Mining Concepts and Techniques. 3rd ed. 225Wyman Street,Waltham, MA 02451, USA: Morgan Kaufmann Publishers is an imprint of Elsevier; 2012.

11. Otsu N. A threshold selection method from gray-level histogram. IEEE Trans Syst Man Cybern. 1979;9(1):62–66. doi:10.1109/TSMC.1979.4310076

12. Gonzalez RC, Woods RE. Digital Image Processing. 3rd ed. Prentice Hall; 2002.

13. Jackson HW, Fischer JR, Zanotelli VRT, et al. The single-cell pathology landscape of breast cancer. Nature. 2020;578(7796):615–620. doi:10.1038/s41586-019-1876-x

14. Mercan E, Selim A, Shapiro LG, Weaver DL, Brunyé TT, Elmore JG. Localization of diagnostically relevant regions of interest in whole slide images: a comparative study. J Digit Imaging. 2014. doi:10.1109/ICPR.2014.212

15. Kumar R, Srivastava R, Srivastava S. Detection and classification of cancer from microscopic biopsy images using clinically significant and biologically interpretable features. J Med Eng. 2015;2015:Article ID 457906, 14 pages. doi:10.1155/2015/457906

16. Li J, Zhang Q. Assessing the accuracy of predictive models for numerical data: not r nor r2, why not? Then what? PLoS One. 2017;12(8):e0183250. doi:10.1371/journal.pone.0183250

17. Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S. Generalized intersection over union: a metric and a loss for bounding box regression. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2019:658–666. arXiv:1902.09630 [cs.CV].

18. Apou G, Naegel B, Forestier G, Feuerhake F, Wemmert C. Efficient Region-based Classification for Whole Slide Images. Computer Vision, Imaging and Computer Graphics - Theory and Applications. VISIGRAPP 2014. Communications in Computer and Information Science. Vol 550. Springer, Cham; 2015, DOI 10.1007/978-3-319-25117-2_15

19. Zhang Z, Chen P, McGough M, et al. Pathologist-level interpretable whole-slide cancer diagnosis with deep learning. Nat Mach Intell. 2019;236 1:236–245. doi:10.1038/s42256-019-0052-1

20. Guo Z, Liu H, Ni H, et al. A fast and refined cancer regions segmentation framework in whole-slide breast pathological images. Sci Rep. 2019;9:882. doi:10.1038/s41598-018-37492-9

21. Marami B, Prastawa M, Chan M, Donovan M, Fernandez G, Zeineh J. Ensemble Network for Region Identification in Breast Histopathology Slides. Springer International Publishing AG, part of Springer Nature; 2018. doi:10.1007/978-3-319-93000-8_98

22. Aresta G, Araujo T, Kwok S, et al. BACH: grand challenge on breast cancer histology images. Med Image Anal. 2019;56:122–139. doi:10.1016/

23. Nawaz W, Ahmed S, Tahir A, Khan HA. Classification of Breast Cancer Histology Images Using ALEXNET. Springer International Publishing AG, part of Springer Nature; 2018. doi:10.1007/978-3-319-93000-8_99

24. Golatkar A, Anand D, Sethi A. Classification of Breast Cancer HistologyUsing Deep Learning. Springer International Publishing AG, part of Springer Nature; 2018. doi:10.1007/978-3-319-93000-8_95

25. Roy K, Banik D, Bhattacharje D, Nasipuri M. Patch-based system for classification of breast histology images using deep learning. Comput Med Imaging Graph. 2018. doi:10.1016/j.compmedimag.2018.11.003

26. Yan R, Ren F, Wang Z, et al. Breast cancer histopathological image classification using a hybrid deep neural network. Methods. 2019. doi:10.1016/j.ymeth.2019.06.014

Creative Commons License © 2020 The Author(s). This work is published and licensed by Dove Medical Press Limited. The full terms of this license are available at and incorporate the Creative Commons Attribution - Non Commercial (unported, v3.0) License. By accessing the work you hereby accept the Terms. Non-commercial uses of the work are permitted without any further permission from Dove Medical Press Limited, provided the work is properly attributed. For permission for commercial use of this work, please see paragraphs 4.2 and 5 of our Terms.