last-pair-probs

This script reads alignments of paired DNA reads to a genome, and:

  1. estimates the distribution of distances between paired reads,
  2. estimates the probability that each alignment represents the genomic source of the read.

The script takes as input two files of alignments, where one read from each pair is in one file, and the other read is in the other file.

Typical usage:

lastdb -m1111110 humandb human/chr*.fa
lastal -Q1 -e120 -i1 humandb reads1.fastq > temp1.maf
lastal -Q1 -e120 -i1 humandb reads2.fastq > temp2.maf
last-pair-probs temp1.maf temp2.maf > results.maf

If your reads come from potentially-spliced RNA molecules, use the -r option:

last-pair-probs -r temp1.maf temp2.maf > results.maf

Without -r, it assumes the distances between paired reads follow a normal distribution. With -r, it assumes the distances follow a skewed (log-normal) distribution, which is much more appropriate for spliced RNA.

Details

Options

-h, --help Print a help message and exit.
-r, --rna Specifies that the fragments are from potentially-spliced RNA.
-e, --estdist Just estimate the distribution of distances.
-m M, --mismap=M
 Don't write alignments with mismap probability > M.
-f BP, --fraglen=BP
 The mean distance in bp. (With -r, the mean of ln[distance].) If this is not specified, the script will estimate it from the alignments.
-s BP, --sdev=BP
 The standard deviation of distance in bp. (With -r, the standard deviation of ln[distance].) If this is not specified, the script will estimate it from the alignments.
-d PROB, --disjoint=PROB
 The prior probability that a pair of reads comes from disjoint locations (e.g., different chromosomes). This may arise from real differences between the genome and the source of the reads, or from errors in obtaining the reads or the genome sequence.
-c CHROM, --circular=CHROM
 Specifies that the chromosome named CHROM is circular. You can use this option more than once (e.g., -c chrM -c chrP). As a special case, "." means all chromosomes are circular. If this option is not used, "chrM" is assumed to be circular (but if it is used, only the specified CHROMs are assumed to be circular.)

Tips

Using multiple CPUs

With large datasets, it's important to go faster by using multiple CPUs. One way to do that is by using GNU parallel (http://www.gnu.org/software/parallel/), as follows.

Reference

For more information, please see this article:

An approximate Bayesian approach for mapping paired-end DNA reads to a reference genome. Shrestha AM, Frith MC. Bioinformatics 2013 29(8):965-972.