mRNA BS-seq pipeline and NSUN2 | Jianheng Liu (刘健恒)

The schema of our optimized workflow.

In breif

We optimized the experimental protocol to improve the conversion rates in mRNA BS-seq.
We analyzed the pattern of artifacts in BS-seq and located the source of false positives.
We introduced Gini-index and signal-nosie ratio into analysis to estimate the status of library conversion.
We introduced C-cutoff and gene-specific conversion rates into our filter to remove false positives.
We set up a new pipeline with high flexibility, robustness and parallelization level for mRNA BS-seq.
We validated that NSUN2 is one of the major m5C methyltransferase acting on mRNA.
We profile mRNA landscapes in multiple human/mouse tissues.
mRNA m5C might relate to translation control.

Where are the noise?

It’s clear now that the noise in BS-seq come from many places:

The human-introduced Cs in hexamer.
The regions hard to be converted, especially some highly structured rRNA regions.
The false alignments and bases with bad sequencing qualities.

Solutions

The pipeline I designed have considered most of the situations:

Hexamer can be removed by trimming the bases at the end of reads.
We should use gene-specific conversion rates rather than a global conversion rate measured by spike-ins or the whole libray. Only analyze annotated regions.
Estimate the conversion status by Gini-index and signal-nosie ratio, and determine the C-cutoff that remove reads have more than a certain number of reads.
Optimize the alignment step to reduce false alignments. Normally, that should not be a matter.
SNPs can only destory m5C, but cannot create m5C in BS-seq.
Statistical tests have little effect on our analysis if we use a filter with high coverage and level.

In my hands, I guess we can let the sites with 5% methylation go, but we normally use a 10% cutoff for the biological importance of a site. BS-seq is very sensitive and reliable if you work with it correctly.

left, why gene-specific conversion rates are important; middle, gini-index vs signal rates in badly converted samples and well converted samples. And some examples for true positives and false positives; right, our filters can work perfectly!

References

2019

NSMB

Genome-wide identification of mRNA 5-methylcytosine in mammals

Tao Huang , Wanying Chen , Jianheng Liu , and 2 more authors

Nature structural & molecular biology, 2019

Abs HTML PDF

Accurate and systematic transcriptome-wide detection of 5-methylcytosine (m5C) has proved challenging, and there are conflicting views about the prevalence of this modification in mRNAs. Here we report an experimental and computational framework that robustly identified mRNA m5C sites and determined sequence motifs and structural features associated with the modification using a set of high-confidence sites. We developed a quantitative atlas of RNA m5C sites in human and mouse tissues based on our framework. In a given tissue, we typically identified several hundred exonic m5C sites. About 62-70% of the sites had low methylation levels (20% methylation), while 8-10% of the sites were moderately or highly methylated (40% methylation). Cross-species analysis revealed that species, rather than tissue type, was the primary determinant of methylation levels, indicating strong cis-directed regulation of RNA methylation. Combined, these data provide a valuable resource for identifying the regulation and functions of RNA methylation.

Back