LAST FAQ

Q: Does it matter which sequence is used as the "reference" (given to lastdb) and which is used as the "query" (given to lastal)?

A: It may do. In short, LAST tries hard to find alignments for every position in the query. When mapping reads to a genome, you probably want the genome to be the reference, and the reads to be the query. That way, for each read, it will search for several most-similar locations in the genome. The other way, for each location in the genome, it will search for several most-similar reads. As another example, if you compare a genome to a library of repeat sequences, you probably want the genome to be the query and the repeat library to be the reference.

Q:	Does it matter which sequence is used as the "reference" (given to lastdb) and which is used as the "query" (given to lastal)?
A:	It may do. In short, LAST tries hard to find alignments for every position in the query. When mapping reads to a genome, you probably want the genome to be the reference, and the reads to be the query. That way, for each read, it will search for several most-similar locations in the genome. The other way, for each location in the genome, it will search for several most-similar reads. As another example, if you compare a genome to a library of repeat sequences, you probably want the genome to be the query and the repeat library to be the reference.

Q:	How can I get the percent-identity of each alignment?
A:	Use maf-convert to convert them to blast or psl format.

Q:	Why is LAST so slow at reading/writing files?
A:	Probably because the files are huge and your disk is slow (or lots of people are using it). Try to use a reasonably fast disk. Typically, after reading a large file once, subsequent reads of the same file are much faster, because the operating system caches it. lastal reads files into shared memory using "mmap", and this occasionally seems to have trouble, notably on "advanced" file systems (Lustre and GlusterFS). On one Linux system, the above-mentioned cache occasionally got into some kind of bad state, making lastal very slow. This was solved by running the following command (as root): echo 1 > /proc/sys/vm/drop_caches

Why is LAST so slow at reading/writing files?

Probably because the files are huge and your disk is slow (or lots of people are using it). Try to use a reasonably fast disk.

Typically, after reading a large file once, subsequent reads of the same file are much faster, because the operating system caches it.

lastal reads files into shared memory using "mmap", and this occasionally seems to have trouble, notably on "advanced" file systems (Lustre and GlusterFS).

On one Linux system, the above-mentioned cache occasionally got into some kind of bad state, making lastal very slow. This was solved by running the following command (as root):

echo 1 > /proc/sys/vm/drop_caches

Q:	I'd like to compare my queries to a database of known proteins. Where can I get a database of known proteins?
A:	You could try UniRef90 or UniRef50 (http://www.uniprot.org/help/uniref), which have reduced redundancy.

Q:	How does LAST get the sequence names? How can I get nice, short, unique names?
A:	The first whitespace-delimited word in the sequence header line is used as the name. You can arbitrarily customise the names using standard Unix tools. For example, this will replace each FASTA name with a unique serial number: awk '/>/ {$0 = ">" ++n} {print}' queries.fasta \| lastal myDb - This will do the same for FASTQ (assuming 4 lines per record, i.e. no line wrapping): awk 'NR % 4 == 1 {$0 = "@" ++n} {print}' queries.fastq \| lastal myDb - Sometimes you can make LAST's output significantly smaller by shortening the names.

How does LAST get the sequence names? How can I get nice, short, unique names?

The first whitespace-delimited word in the sequence header line is used as the name. You can arbitrarily customise the names using standard Unix tools. For example, this will replace each FASTA name with a unique serial number:

awk '/>/ {$0 = ">" ++n} {print}' queries.fasta | lastal myDb -

This will do the same for FASTQ (assuming 4 lines per record, i.e. no line wrapping):

awk 'NR % 4 == 1 {$0 = "@" ++n} {print}' queries.fastq | lastal myDb -

Sometimes you can make LAST's output significantly smaller by shortening the names.

Q:	How can I find alignments with > 95% identity?
A:	One way is to use a scoring scheme like this: +5 for a match, and -95 for a mismatch or a gap. You'll also need to set the alignment score threshold to a reasonable value. In this example we set it to 150, which means that we require at least 30 matches: lastal -r5 -q95 -a0 -b95 -e150

How can I find alignments with > 95% identity?

One way is to use a scoring scheme like this: +5 for a match, and -95 for a mismatch or a gap. You'll also need to set the alignment score threshold to a reasonable value. In this example we set it to 150, which means that we require at least 30 matches:

lastal -r5 -q95 -a0 -b95 -e150