International consortia such as the Genotype-Tissue Expression(GTEx) project, The Cancer Genome Atlas (TCGA) or theInternational Human Epigenetics Consortium (IHEC) have produceda wealth of genomic datasets with the goal of advancing ourunderstanding of cell differentiation and disease mechanisms.However, utilizing all of these data effectively throughintegrative analysis is hampered by batch effects, large celltype heterogeneity and low replicate numbers. To study if batcheffects across datasets can be observed and adjusted for, weanalyze RNA-seq data of 215 samples from ENCODE, Roadmap,BLUEPRINT and DEEP as well as 1336 samples from GTEx and TCGA.While batch effects are a considerable issue, it is non-trivialto determine if batch adjustment leads to an improvement in dataquality, especially in cases of low replicate numbers.We presenta novel method for assessing the performance of batch effectadjustment methods on heterogeneous data. Our method borrowsinformation from the Cell Ontology to establish if batchadjustment leads to a better agreement between observed pairwisesimilarity and similarity of cell types inferred from theontology. A comparison of state-of-the art batch effectadjustment methods suggests that batch effects in heterogeneousdatasets with low replicate numbers cannot be adequatelyadjusted. Better methods need to be developed, which can beassessed objectively in the framework presented here.Our methodis available online at https://github.com/SchulzLab/OntologyEval.Supplementary data areavailable at Bioinformatics online.