Unsupervised clustering
The derivation of the asthma phenotypes was done using Deep Embedded Clustering (DEC) 13. DEC is a novel approach that combines deep learning, which is an advanced form of machine learning technology, with clustering, allowing for the discovery of complex patterns in data and providing a robust, scalable solution for clustering large datasets without the need for labelled data13. This makes it particularly valuable for applications where the true cluster structure is unknown or hard to define a priori 13. DEC has an advantage over traditional clustering methods because of its ability to learn a lower-dimensional representation (feature space) of data using deep autoencoders. This feature space is more suitable for clustering due to compact representation at lower dimensionality, allowing DEC to outperform traditional methods that either do not involve feature learning or rely on simpler, linear dimensionality reduction techniques13. Secondly, DEC’s iterative optimization process that utilizes distance metrices to optimize both the feature representation and cluster assignments in a way that traditional methods, such as k-means or spectral clustering, cannot13. These qualities make DEC particularly effective for complex datasets, offering improved clustering accuracy, efficiency in handling large datasets.
After the data was processes, The R package NbClust was used to decide the optimal number of clusters using voting consensus methods14. Additionally, the optimal number of cluster was confirmed using Monte Carlo reference based consensus clustering approach 15, implemented through M3C R package16. The output was further fed into the DEC algorithm to perform the clustering. The cluster were later validated using prediction strength approach. The final numbers proposed by such metrics were then evaluated in conjunction with clinical experience before a final determination of the optimal number of clusters were decided to represent the data. The cluster solution determined were then named based on their distribution with regards to the variables used to derive the clusters. A detailed statistical implementation is presented in theSupplementary file .