Content:
1.Typical usage
2.Retrieving sequence ranges or deflines
3.Data compression option
4.Development notes
cdbfasta <fasta_file>The fasta file can be specified with the whole path (if it's not in the current directory), e.g.
cdbfasta /usr/local/db/GUDB.humanBy default cdbfasta creates an index file with the same path and name as the database file but with the .cidx suffix added to the original name. So in the example above, a file GUDB.human.cidx will be created in /usr/local/db/. The default usage considers the key for a FASTA record to be the first space-delimited token following the ">" starting character from the definition line. For example, if a FASTA record had a defline like this:
>AA141526
...then we can use the string 'AA141526' with cdbyank to retrieve the full FASTA record associated to that sequence name:
cdbyank -a 'AA141526' /usr/local/db/GUDB.human.cidxSometimes all the space delimited tokens in the defline need to be declared as keys in the index file, pointing to the same fasta record. This can be accomplished by cdbfasta by using the "-m" switch.
For long and complex fastA file accessions (for example : EGAD|61|GP|186739|gb|AAA63210.1||M60828) there is a possibility to create the index file in such a way that there is no need to provide the full string to cdbyank in order to retrieve such a sequence, but only the first "<db>|<accession>" pair (i.e. a substring ending at the second '|' character) should be enough. (EGAD|61 in the example above). In order to enable this feature, there are two alternative options for cdbfasta:
cdbyank -a 'human|Z98492' /usr/local/db/GUDB.human.cidxA list of accessions is expected at stdin if -a option is not provided, e.g.:
cat seq_list | cdbyank /usr/local/db/GUDB.human.cidxThis way the output will be a series a fasta records at stdout. By redirecting this output to a file a multifasta file is obtained. cdbyank locates the database file by stripping the '.cidx' suffix off the index filename. But this is not enforced, because by using the -d option, cdbyank can make use of a user-provided database to be used by the given index file. In the example above, if the index file "GUDB.human.cidx" is moved into another directory, a cdbyank command (in that other directory) can be issued like that:
cdbyank -a 'human|Z98492' -d /usr/local/db/GUDB.human GUDB.human.cidxThe position of the index file in the list of arguments of cdbyank is not enforced. For the -a usage, the error status returned by cdbyank to the shell will be 1 if the given key was not found and 0 for success.
The total number of fasta records indexed and the list of the keys stored in a specific cdb index file can be retrieved with cdbyank's -n and -l switches, respectively. This information is obtained from the index file directly (the database file is not needed for that). There is also a -s option that displays a summary of the indexing information stored in the index at index time. These are the initial name of the fastA file, its size, how the index was created (e.g. was -m (multiple keys) option given ? was -c or -C (shortcut keys) option given?), the number of keys stored in the file as well as the number of fasta records indexed - the latter being the same with what -n option returns.
As an extra feature, cdbfasta and cdbyank can also be used for some
special cases where databases may have different records but with the same
key (non-unique keys). Although the performance will degrade a little,
cdbfasta is able to index this kind of files, but by default cdbyank only
outputs the first record found. If you want all the possible records sharing
the same key (accession) to be retrieved and displayed, the -x option
should be given to cdbyank.
There are two cdbyank options added for convenience: -F
option returns the definition line of each requested FASTA record (the
first line for each record). The -R option of cdbyank is intended
for FASTA files containing actual genetic sequences (nucleotide or protein)
and expects each of the retrieval commands to have the following format
(space delimited):
<key> <right_coordinate> <left_coordinate>
For example if we only want to retrieve the sequence range 24...178 (letter numbering starts at 1) from sequence with the name 'human|Z98492', then the cdbyank command would look like this:
cdbyank -a 'human|Z98492 24 178' -R GUDB.human.cidxMultiple sequence ranges can be extracted this way by providing a file having each line following the format above (key followed by the two coordinates). Then, as before, such file can be piped into cdbyank with -R option to pull specific sequence ranges for each of the sequences specified in the input file.
cat seqlistranges | cdbyank -R GUDB.human.cidxNote that this range option works by actually parsing and looping through the retrieved record characters internally - so the performance is poor when some terminal range is pulled from a very large record.
cat my_data_files* | cdbfasta - -z mydata.cdbzThis option is useful especially when the total size of input data files is extremely large (over the file-system limits or over the 4GB internal limit of cdbfasta) while the compressed output can be small enough to fall under such limits.
Please let me know if you notice any problems with these tools.
--
Geo Pertea
geo.pertea@gmail.com
06/09/2003