last-pair-probs

This script reads alignments of paired DNA reads to a genome, and:

estimates the distribution of distances between paired reads,
estimates the probability that each alignment represents the genomic source of the read.

The script takes as input two files of alignments, where one read from each pair is in one file, and the other read is in the other file.

Typical usage:

lastdb -m1111110 humandb human/chr*.fa
lastal -Q1 -e120 -i1 humandb reads1.fastq > temp1.maf
lastal -Q1 -e120 -i1 humandb reads2.fastq > temp2.maf
last-pair-probs temp1.maf temp2.maf > results.maf

If your reads come from potentially-spliced RNA molecules, use the -r option:

last-pair-probs -r temp1.maf temp2.maf > results.maf

Without -r, it assumes the distances between paired reads follow a normal distribution. With -r, it assumes the distances follow a skewed (log-normal) distribution, which is much more appropriate for spliced RNA.

Details

The script writes the alignments with "mismap" probabilities, i.e. the probability that the alignment does not represent the genomic source of the read. By default, it discards alignments with mismap probability > 0.01.
It assumes that each pair of reads comes from opposite strands of a DNA molecule. The "distance" between them means the distance between their 5' ends. Positive distance indicates tail-to-tail orientation (with the 5' end being the head and the 3' end being the tail). Negative distance indicates head-to-head orientation. Negative distances are not considered when -r is used, nor for circular chromosomes.
The script reads one batch of alignments at a time from each file (by looking for lines starting with "# batch"). It assumes that each pair of batches has the alignments for one pair of reads. To achieve that, the read pairs should be in the same order in the fastq files, and the lastal -i1 option ensures there is one query per batch.
The input files may be in either format produced by lastal (maf or tabular). They must include header lines (of the kind produced by lastal) describing the alignment parameters.
By default, the script makes two passes over each file. So they must be real files (not pipes). However, if you use option -e, or both -f and -s, it makes just one pass over each file.
If a read name ends in neither "/1" nor "/2", the script appends "/1" if it comes from the first file or "/2" if it comes from the second.

Options

-h, --help Print a help message and exit.

-r, --rna Specifies that the fragments are from potentially-spliced RNA.

-e, --estdist Just estimate the distribution of distances.

-m M, --mismap=M

Don't write alignments with mismap probability > M.

-f BP, --fraglen=BP

The mean distance in bp. (With -r, the mean of ln[distance].) If this is not specified, the script will estimate it from the alignments.

-s BP, --sdev=BP

The standard deviation of distance in bp. (With -r, the standard deviation of ln[distance].) If this is not specified, the script will estimate it from the alignments.

-d PROB, --disjoint=PROB

The prior probability that a pair of reads comes from disjoint locations (e.g., different chromosomes). This may arise from real differences between the genome and the source of the reads, or from errors in obtaining the reads or the genome sequence.

-c CHROM, --circular=CHROM

Specifies that the chromosome named CHROM is circular. You can use this option more than once (e.g., -c chrM -c chrP). As a special case, "." means all chromosomes are circular. If this option is not used, "chrM" is assumed to be circular (but if it is used, only the specified CHROMs are assumed to be circular.)

Tips

For greater speed and convenience, first estimate the distance distribution from a sample of your data (using -e), then analyze all your data with this estimate (using -f and -s).

You can avoid temp files by using named pipes:

mkfifo pipe1 pipe2
lastal -Q1 -e120 -i1 humandb reads1.fastq > pipe1 &
lastal -Q1 -e120 -i1 humandb reads2.fastq > pipe2 &
last-pair-probs -f250 -s30 pipe1 pipe2 > results.maf

This streams the alignments from lastal to last-pair-probs.

To go faster, try a fast version of Python such as PyPy.
To go really fast, try gapless alignment (add -j1 to the lastal options). Often, this is only minusculely less accurate than gapped alignment.
Tabular output (lastal option -f0) is smaller and faster.

Using multiple CPUs

With large datasets, it's important to go faster by using multiple CPUs. One way to do that is by using GNU parallel (http://www.gnu.org/software/parallel/), as follows.

Split reads1.fastq into multiple files, with (say) 100000 reads per file:
```
split -l400000 -a5 reads1.fastq 1
```
This assumes that a new fastq record begins every 4th line, i.e. no line wrapping. The created files will be called 1aaaaa, 1aaaab, etc.
Split reads2.fastq:
```
split -l400000 -a5 reads2.fastq 2
```

Make a file called (say) last-pair.sh, with the following content (or similar - you might want to use -r, -j1, etc):

#! /bin/sh
lastal -Q1 -e120 -i1 "$3" "$4" > /tmp/$$.1
lastal -Q1 -e120 -i1 "$3" "$5" > /tmp/$$.2
last-pair-probs -f "$1" -s "$2" /tmp/$$.1 /tmp/$$.2
rm /tmp/$$.*

Set execute permission:
```
chmod +x last-pair.sh
```
Run it in parallel on all your CPU cores:
```
parallel --xapply ./last-pair.sh 250 30 humandb ::: 1* ::: 2* > results.maf
```
Here we have specified a mean distance of 250 and a standard deviation of 30.

Reference

For more information, please see this article:

An approximate Bayesian approach for mapping paired-end DNA reads to a reference genome. Shrestha AM, Frith MC. Bioinformatics 2013 29(8):965-972.

`-h, --help`	Print a help message and exit.
`-r, --rna`	Specifies that the fragments are from potentially-spliced RNA.
`-e, --estdist`	Just estimate the distribution of distances.
`-m M, --mismap=M`
	Don't write alignments with mismap probability > M.
`-f BP, --fraglen=BP`
	The mean distance in bp. (With -r, the mean of ln[distance].) If this is not specified, the script will estimate it from the alignments.
`-s BP, --sdev=BP`
	The standard deviation of distance in bp. (With -r, the standard deviation of ln[distance].) If this is not specified, the script will estimate it from the alignments.
`-d PROB, --disjoint=PROB`
	The prior probability that a pair of reads comes from disjoint locations (e.g., different chromosomes). This may arise from real differences between the genome and the source of the reads, or from errors in obtaining the reads or the genome sequence.
`-c CHROM, --circular=CHROM`
	Specifies that the chromosome named CHROM is circular. You can use this option more than once (e.g., -c chrM -c chrP). As a special case, "." means all chromosomes are circular. If this option is not used, "chrM" is assumed to be circular (but if it is used, only the specified CHROMs are assumed to be circular.)