DNA methylation; machine learning; cancer; disease diagnostic predictive models; algorithm and techniques to speed up the analysis of big medical data.; classification.
A well-studied genetic modification is crucial to regulate the functioning of the genome, which is done with the help of DNA Methylation. Alteration of DNA plays a vital role in tumor generation (tumorigenesis) and tumor-suppression. Therefore, studying DNA methylation data may help in identifying basic molecules or elements in body that indicates the presence of cancer. DNA methylation related data available from the public is huge – and considering the high number of methylated sites (features) present in the genome – it is crucial to have a technology for efficient processing of huge datasets. With the help of big data technologies, we propose an algorithm that can apply supervised learning in the form of classification methods to datasets with large amount of features. Through iterative deletion of selected features, extraction of equivalent classification models is possible using this algorithm. The experiments will be executed on DNA methylation datasets extracted from The Cancer Genome Atlas, where we will be focusing on three types of tumors: breast, kidney, and thyroid carcinomas. Several methylated sites and their associated genes will be extracted and classification will be performed on them with accurate performance. Thereafter, we will study the performance of our algorithm and compare it with other classifiers and with existing approaches used to analyze this data i.e, a wide-spread DNA methylation analysis method based on network analysis. Finally, we will be able to efficiently compute multiple alternative classification models and extract a set of candidate genes from DNA-methylation large datasets to be further examined to determine their role in cancer.