Poster Presentation Australasian RNA Biology and Biotechnology Association 2024 Conference

Evaluation the reproducibility of m6A peak calls across public databases (#172)

Gavin J Sutton 1 , Renhua Song 1 , Fuyi Li 2 3 , Qian Liu 4 5 , Justin J-L Wong 1
  1. School of Medical Sciences, Faculty of Medicine and Health, University of Sydney, Sydney, NSW, Australia
  2. College of Information Engineering, Northwest A&F University, Yangling, Shaanxi, China
  3. South Australian immunoGENomics Cancer Institute , The University of Adelaide, Adelaide, South Australia, Australia
  4. Nevada Institute of Personalized Medicine, University of Nevada, Las Vegas, Maryland, USA
  5. School of Life Sciences, University of Nevada, Las Vegas, Maryland, USA

N6-methyladenosine (m6A) is a widely-studied modification to messenger RNAs, which has been linked to diverse cellular processes and human diseases. Numerous databases have been developed to reprocess and collate m6A calls across tissues, cell-types, and phenotypes, facilitating non-expert researchers to mine m6A in their genes-of-interest. Here, we evaluate the reproducibility and accuracy of 9 such databases. Whilst recent work has highlighted low reproducibility across experiments within a cell-type, we find even single experiments are reprocessed across databases to produce highly variable results, including a three-fold difference in the number of m6A peaks called, with <25% of peaks being reproduced by more than half of the databases. This is driven by parameter and algorithm choices across their pipelines. Further, many databases report peaks from less refined m6A-enrichment protocols, which may contribute a higher false positive rate. Ultimately, to ensure that time and resources are allocated to studying real m6A sites, we recommend users confirm that putative site is reproduced 1) across databases with various processing pipelines, and 2) across studies within each database, with some of those studies using refined m6A-enrichment protocols. For the broader bioinformatics community, however, this study provides clear observational evidence that the same input data will, in the hands of different analysis teams, produce starkly divergent output; it suggests that greater efforts should be made to ensure the reproducibility of our analyses.