mRNA BS-seq pipeline and NSUN2
We still don't know the function and mechanism behind it
data:image/s3,"s3://crabby-images/7468a/7468a1baeaa4f9fdba30221524dfa0427b1f6fc4" alt=""
The schema of our optimized workflow.
In breif
- We optimized the experimental protocol to improve the conversion rates in mRNA BS-seq.
- We analyzed the pattern of artifacts in BS-seq and located the source of false positives.
- We introduced
Gini-index
andsignal-nosie ratio
into analysis to estimate the status of library conversion. - We introduced
C-cutoff
andgene-specific conversion rates
into our filter to remove false positives. - We set up a
new pipeline
with high flexibility, robustness and parallelization level for mRNA BS-seq. - We validated that NSUN2 is one of the major m5C methyltransferase acting on mRNA.
- We profile mRNA landscapes in multiple human/mouse tissues.
- mRNA m5C might relate to translation control.
Where are the noise?
It’s clear now that the noise in BS-seq come from many places:
- The human-introduced Cs in hexamer.
- The regions hard to be converted, especially some highly structured rRNA regions.
- The false alignments and bases with bad sequencing qualities.
Solutions
The pipeline I designed have considered most of the situations:
- Hexamer can be removed by trimming the bases at the end of reads.
- We should use
gene-specific
conversion rates rather than aglobal conversion rate
measured by spike-ins or the whole libray. Only analyze annotated regions. - Estimate the conversion status by
Gini-index
andsignal-nosie ratio
, and determine theC-cutoff
that remove reads have more than a certain number of reads. - Optimize the alignment step to reduce false alignments. Normally, that should not be a matter.
- SNPs can only destory m5C, but cannot create m5C in BS-seq.
- Statistical tests have little effect on our analysis if we use a filter with high coverage and level.
In my hands, I guess we can let the sites with 5% methylation go, but we normally use a 10% cutoff for the biological importance of a site. BS-seq is very sensitive and reliable if you work with it correctly.
data:image/s3,"s3://crabby-images/2d734/2d7345cfafd4383773c80acecfc97c462d174674" alt=""
data:image/s3,"s3://crabby-images/8fa2a/8fa2a3eaf4aa032c29bebf1296bcb014a9aa89db" alt=""
data:image/s3,"s3://crabby-images/70945/70945ada6687c2e66730df4bbd38cb37930d35e9" alt=""
data:image/s3,"s3://crabby-images/0ddec/0ddecdd20c519212ca583e26810c8f819f8a339f" alt=""
left, why gene-specific conversion rates are important; middle, gini-index vs signal rates in badly converted samples and well converted samples. And some examples for true positives and false positives; right, our filters can work perfectly!