Identification of novel differentially methylated dna regions using active learning and imbalanced class learners
Haque, Md Muksitul
MetadataShow full item record
Epigenetics refers to the changes in gene expression which are caused by other mechanisms other than the DNA sequence. Environmental influences can alter epigenetic states in the germ line that can be further transmitted to future generations. Therefore, epigenetic markers can be correlated to exposures and health risk for diseases. DNA methylation based biomarkers are very promising and a large number of potential biomarkers have been identified for diseases such as cancer. Our goal is to identify regions with susceptibility to be differentially methylated in the genome. Biological datasets come with inherent challenges such as having low volume and high dimensionality. Most data we are interested in (e.g., positive cases of disease state) are rare and come with many characteristics or features. Such interesting computational problems can be approached using Machine Learning techniques. Our goal is to answer two fundamental challenges. They are to find most relevant genomic features for learning and to perform efficient learning when the classes are imbalanced. Efficient learning can be performed only when targetconcepts from both the classes (DMR and non-DMR) are learned well to distinguish them separately while learning from only the relevant features.We propose Generalized Query Based Active Learning (GQAL) which constructs intelligent queries by removing irrelevant features from the query which an Oracle (e.g. a human expert) can answer easily. This approach allows the learner to label multiple instances at thesame time, makes use of the most relevant features per query and performs our first challenge. For the class imbalance problem we use a boosting technique called AdaBoost or "Adaptive Boosting" in our study. This approach allows creating a learner which will learn target conceptswell from both the classes which addresses our second challenge. Currently there are no machine learning approaches applied to epigenetic datasets addressing these problems. We apply GQAL and TAN+AdaBoost on several datasets and show that our method is better for prediction for epigenetic datasets than other popular learners. This proposed two-step DMR identification framework will allow to predict novel epigenetic biomarkers which will assist in diagnosis of disease susceptibility and provide new therapeutic targets.