Aligning bisulfite-converted DNA reads to a genome ================================================== Bisulfite is used to detect methylated cytosines. It converts unmethylated Cs to Ts, but it leaves methylated Cs intact. If we then sequence the DNA and align it to a reference genome, we can infer cytosine methylation. To align the DNA accurately, we should take the C->T conversion into account. Here is how to do it with LAST. Let's assume we have bisulfite-converted DNA reads in a file called "reads.fastq" (in fastq-sanger format), and the genome is in "mygenome.fa" (in fasta format). We will also assume that all the reads are from the converted strand, and not its reverse-complement (i.e. they have C->T conversions and not G->A conversions). First, we need to run lastdb twice, for forward-strand and reverse-strand alignments:: lastdb -uBISF my_f mygenome.fa lastdb -uBISR my_r mygenome.fa Then, there are several steps: 1. Convert all Cs in the reads to Ts. (Debatable: this slightly degrades LAST's ability to align the reads, but it avoids a bias, due to unconverted DNA being easier to align than converted DNA.) 2. Align the reads, one strand at a time. 3. Merge the alignments, and find a unique best alignment for each part of each read. 4. Undo the conversion from step 1. The last-bisulfite script (in the examples directory) carries out steps 1-4. Its usage is:: last-bisulfite.sh my_f my_r reads.fastq > results.maf You can parallelize it like this:: parallel-fastq "last-bisulfite.sh my_f my_r" < reads.fastq > results.maf Tips ---- * To go faster, try gapless alignment (add -j1 to the lastal options). Often, this is only minusculely less accurate than gapped alignment. * The .tis files created by lastdb (e.g. my_f.tis, my_r.tis) are identical. So you can shave a few GB by linking them:: ln -f my_f.tis my_r.tis Paired-end DNA reads -------------------- You can align paired-end reads by combining the preceding recipe with the one in ``_. This gets a bit complicated, so we provide a last-bisulfite-paired script in the examples directory. Typical usage:: lastdb -uBISF my_f mygenome.fa lastdb -uBISR my_r mygenome.fa last-bisulfite-paired.sh my_f my_r reads1.fastq reads2.fastq > results.maf This assumes that reads1.fastq are all from the converted strand (i.e. they have C->T conversions) and reads2.fastq are all from the reverse-complement (i.e. they have G->A conversions). It requires GNU parallel to be installed. You are encouraged to customize this script.