Aligning bisulfite-converted DNA reads to a genome

Bisulfite is used to detect methylated cytosines. It converts unmethylated Cs to Ts, but it leaves methylated Cs intact. If we then sequence the DNA and align it to a reference genome, we can infer cytosine methylation.

To align the DNA accurately, we should take the C->T conversion into account. Here is how to do it with LAST.

Let's assume we have bisulfite-converted DNA reads in a file called "reads.fastq" (in fastq-sanger format), and the genome is in "mygenome.fa" (in fasta format). We will also assume that all the reads are from the converted strand, and not its reverse-complement (i.e. they have C->T conversions and not G->A conversions).

First, we need to run lastdb twice, for forward-strand and reverse-strand alignments:

lastdb -uBISF my_f mygenome.fa
lastdb -uBISR my_r mygenome.fa

Then, there are several steps:

  1. Convert all Cs in the reads to Ts. (Debatable: this slightly degrades LAST's ability to align the reads, but it avoids a bias, due to unconverted DNA being easier to align than converted DNA.)
  2. Align the reads, one strand at a time.
  3. Merge the alignments, and find a unique best alignment for each part of each read.
  4. Undo the conversion from step 1.

The last-bisulfite script (in the examples directory) carries out steps 1-4. Its usage is:

last-bisulfite.sh my_f my_r reads.fastq > results.maf

You can parallelize it like this:

parallel-fastq "last-bisulfite.sh my_f my_r" < reads.fastq > results.maf

Tips

Paired-end DNA reads

You can align paired-end reads by combining the preceding recipe with the one in last-pair-probs.html. This gets a bit complicated, so we provide a last-bisulfite-paired script in the examples directory. Typical usage:

lastdb -uBISF my_f mygenome.fa
lastdb -uBISR my_r mygenome.fa

last-bisulfite-paired.sh my_f my_r reads1.fastq reads2.fastq > results.maf

This assumes that reads1.fastq are all from the converted strand (i.e. they have C->T conversions) and reads2.fastq are all from the reverse-complement (i.e. they have G->A conversions). It requires GNU parallel to be installed. You are encouraged to customize this script.