Autoencoders for Clinical Data Analysis: Application of Neural Network-Based Dimensionality Reduction on Fine-Needle Aspiration Breast Data
Selçuk TEKGÖZ
Niğde Provincial Health Directorate, Statistics Unit, Niğde, Türkiye.
Derviş TOPUZ
*
Department of Medical Services and Techniques, Niğde Zübeyde Hanım Health Services Vocational School, Niğde Ömer Halisdemir University, 51240, Niğde, Türkiye.
*Author to whom correspondence should be addressed.
Abstract
Objective: Machine learning provides powerful tools for analyzing large datasets; however, it faces challenges such as high computational costs and overfitting. To overcome these issues techniques that reduce the dimensionality of data are frequently used. Dimensionality reduction aims to eliminate redundant or unnecessary information in the dataset thereby reducing computational load and improving the model's ability to generate more accurate results. The primary objective of this study is to evaluate the performance of the Autoencoder algorithm, one of the dimensionality reduction methods. This study will thoroughly examine the effectiveness of the Autoencoder algorithm in terms of data loss processing time and the model’s performance on new data.
Materials and Methods: Breast masses can be effectively analyzed using quantitative features of cell nuclei obtained from fine-needle aspiration (FNA) samples. This study aimed to evaluate the performance of the Autoencoder algorithm for dimensionality reduction on these features. The analysis was conducted on 569 cases from the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, accessible via an online repository provided by the University of Wisconsin–Madison. The dataset included quantitative features for each cell nucleus, specifically radius, smoothness, compactness, and concavity. The Autoencoder algorithm was applied to the entire dataset to reduce dimensionality while preserving relevant information. To illustrate its operation concretely, four primary features from five randomly selected observations were used, demonstrating the algorithm’s performance on a small, non-linear subset of the data. For comparison, commonly used dimensionality reduction techniques, including Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA), were also applied. Results indicate that the flexible architecture of Autoencoders effectively captures the most informative features, supporting their applicability to clinical datasets and potential integration into computer-aided diagnostic workflows. This approach provides a reliable foundation for analyzing complex biomedical data and assessing algorithm performance in real-world clinical contexts.
Results: This study focused on a comparative analysis of four key variables radius mean, smoothness mean, compactness mean, and concavity mean using both the original dataset and the dataset reconstructed through an Autoencoder model. In the original dataset, the mean and standard deviation values of these variables were calculated as 14.13 ± 3.52, 0.10 ± 0.01, 0.10 ± 0.05, and 0.09 ± 0.08, respectively. At the output layer, the Autoencoder successfully reconstructed the input features, preserving their mean values and yielding corresponding mean ± standard deviation values of 14.13 ± 2.38, 0.10 ± 0.01, 0.10 ± 0.05, and 0.09 ± 0.07. The reduction in standard deviations in the reconstructed dataset, particularly for the radius mean and concavity mean variables, indicates decreased variability and suggests that the model produced a more compact representation while retaining the essential characteristics of the data. The primary objective of the Autoencoder is to ensure that the output closely resembles the original input by utilizing a hidden layer (h) that captures the essential structure of the data. Aligned with this purpose, the algorithm effectively compressed the four-dimensional input into a more compact latent representation while preserving key characteristics. The analyses showed that the hidden layer representations were highly consistent with the original data and were optimized successfully. Consequently, the dimensionality of the dataset was reduced from four variables to a lower-dimensional representation, enabling a more efficient and informative encoding of the data.
Conclusion: This study evaluated the performance of the Autoencoder algorithm for dimensionality reduction using quantitative features of cell nuclei obtained from fine-needle aspiration (FNA) samples of breast masses. The analysis was conducted on a dataset of 569 cases, and to illustrate the algorithm’s operation, data from four key features (radius, smoothness, compactness, and concavity) of five randomly selected observations were used as examples. This approach allowed for the demonstration of the Autoencoder’s performance on small and non-linear subsets of the data. The findings indicate that dimensionality reduction plays a significant role in clinical data analysis and that the Autoencoder algorithm also reduces computational costs. These results confirm the potential of Autoencoders as a reliable and effective tool for dimensionality reduction. Consequently, the use of Autoencoders can enable faster, more accurate, and more efficient processing of healthcare data, thereby enhancing the effectiveness of clinical decision support systems.
Keywords: Machine learning, dimensionality reduction, PCA, LDA, autoencoders, breast cander