Classification of Breast Cancer Subtypes by combining GeneExpression and DNA Methylation Data


Selecting the most promising treatment strategy for breast cancercrucially depends on determining the correct subtype. In recentyears, gene expression profiling has been investigated as analternative to histochemical methods. Since databases like TCGAprovide easy and unrestricted access to gene expression data forhundreds of patients, the challenge is to extract a minimaloptimal set of genes with good prognostic properties from a largebulk of genes making a moderate contribution to classification.Several studies have successfully applied machine learningalgorithms to solve this so-called gene selection problem.However, more diverse data from other OMICS technologies areavailable, including methylation. We hypothesize that combiningmethylation and gene expression data could already lead to alargely improved classification model, since the resulting modelwill reflect differences not only on the transcriptomic, but alsoon an epigenetic level. We compared so-called random forestderived classification models based on gene expression andmethylation data alone, to a model based on the combined featuresand to a model based on the gold standard PAM50. We obtainedbootstrap errors of 10-20% and classification error of 1-50%,depending on breast cancer subtype and model. The gene expressionmodel was clearly superior to the methylation model, which wasalso reflected in the combined model, which mainly selectedfeatures from gene expression data. However, the methylationmodel was able to identify unique features not considered asrelevant by the gene expression model, which might provide deeperinsights into breast cancer subtype differentiation on anepigenetic level.

J. Integr. Bioinform.
Markus List
Head of the Research Group Big Data in Biomedicine