This script reads alignments of paired DNA reads to a genome, and:
The script takes as input two files of alignments, where one read from each pair is in one file, and the other read is in the other file.
Typical usage:
lastdb -m1111110 humandb human/chr*.fa lastal -Q1 -e120 -i1 humandb reads1.fastq > temp1.maf lastal -Q1 -e120 -i1 humandb reads2.fastq > temp2.maf last-pair-probs temp1.maf temp2.maf > results.maf
If your reads come from potentially-spliced RNA molecules, use the -r option:
last-pair-probs -r temp1.maf temp2.maf > results.maf
Without -r, it assumes the distances between paired reads follow a normal distribution. With -r, it assumes the distances follow a skewed (log-normal) distribution, which is much more appropriate for spliced RNA.
-h, --help Print a help message and exit. -r, --rna Specifies that the fragments are from potentially-spliced RNA. -e, --estdist Just estimate the distribution of distances. -m M, --mismap=M Don't write alignments with mismap probability > M. -f BP, --fraglen=BP The mean distance in bp. (With -r, the mean of ln[distance].) If this is not specified, the script will estimate it from the alignments. -s BP, --sdev=BP The standard deviation of distance in bp. (With -r, the standard deviation of ln[distance].) If this is not specified, the script will estimate it from the alignments. -d PROB, --disjoint=PROB The prior probability that a pair of reads comes from disjoint locations (e.g., different chromosomes). This may arise from real differences between the genome and the source of the reads, or from errors in obtaining the reads or the genome sequence. -c CHROM, --circular=CHROM Specifies that the chromosome named CHROM is circular. You can use this option more than once (e.g., -c chrM -c chrP). As a special case, "." means all chromosomes are circular. If this option is not used, "chrM" is assumed to be circular (but if it is used, only the specified CHROMs are assumed to be circular.)
For greater speed and convenience, first estimate the distance distribution from a sample of your data (using -e), then analyze all your data with this estimate (using -f and -s).
You can avoid temp files by using named pipes:
mkfifo pipe1 pipe2 lastal -Q1 -e120 -i1 humandb reads1.fastq > pipe1 & lastal -Q1 -e120 -i1 humandb reads2.fastq > pipe2 & last-pair-probs -f250 -s30 pipe1 pipe2 > results.maf
This streams the alignments from lastal to last-pair-probs.
To go faster, try a fast version of Python such as PyPy.
To go really fast, try gapless alignment (add -j1 to the lastal options). Often, this is only minusculely less accurate than gapped alignment.
Tabular output (lastal option -f0) is smaller and faster.
With large datasets, it's important to go faster by using multiple CPUs. One way to do that is by using GNU parallel (http://www.gnu.org/software/parallel/), as follows.
Split reads1.fastq into multiple files, with (say) 100000 reads per file:
split -l400000 -a5 reads1.fastq 1
This assumes that a new fastq record begins every 4th line, i.e. no line wrapping. The created files will be called 1aaaaa, 1aaaab, etc.
Split reads2.fastq:
split -l400000 -a5 reads2.fastq 2
Make a file called (say) last-pair.sh, with the following content (or similar - you might want to use -r, -j1, etc):
#! /bin/sh lastal -Q1 -e120 -i1 "$3" "$4" > /tmp/$$.1 lastal -Q1 -e120 -i1 "$3" "$5" > /tmp/$$.2 last-pair-probs -f "$1" -s "$2" /tmp/$$.1 /tmp/$$.2 rm /tmp/$$.*
Set execute permission:
chmod +x last-pair.sh
Run it in parallel on all your CPU cores:
parallel --xapply ./last-pair.sh 250 30 humandb ::: 1* ::: 2* > results.maf
Here we have specified a mean distance of 250 and a standard deviation of 30.
For more information, please see this article:
An approximate Bayesian approach for mapping paired-end DNA reads to a reference genome. Shrestha AM, Frith MC. Bioinformatics 2013 29(8):965-972.