Instructions for generating input files for Argot2 batch process

If your goal is to annotate a large set of sequences (or entire genomes) you can use Argot2 batch processing section. In this case we cannot provide the computational resources necessary to accomplish blast and HMMER analysis but we need you to do them by yourself. You will have to submit the output files from BLAST and HMMER (both of them are recommended, but only one is necessary) in the correct format.

Below there is a list of MANDATORY instructions that MUST be followed carefully for a correct generation of Argot2 input files.

Input files for BLAST and HMM

Your sequences must be in fasta format. Header lines must contain a '>' followed by an alphanumeric univocal string, we call unique ID, containing no spaces and not longer than 20 characters (the rest of the eventual comments in the fasta header, separated by one or more white spaces from the unique ID, will not be considered). If you want to perform both BLAST and HMMER searches you need to have protein sequences only and the input file for both these searches must be the same.

Databases

BLAST and HMM searches MUST be performed against Uniprot and P-fam databases, respectively.

You can download Uniprot database at http://www.uniprot.org/ (in our server we use both SwissProt and Trembl datasets from Uniprot).

You can download P-fam database at one of the following sites (in our server we use both Pfam-A and Pfam-B):

BLAST

Download and install ncbi-blast-2.2.24+ (http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download).

Once you have BLAST working on your machines you can launch blastp or blastx (if you have protein or nucleotide sequences, respectively) with the the option for custom tabular output (-outfmt "6 qseqid sseqid evalue").

An example of command line for launching protein blast is:

blastp -outfmt "6 qseqid sseqid evalue " -query your_sequences -db Uniprot -out output_file

If you want to merge several Blast results together please ensure that the sequences in the final output file preserve the contiguity property by applying a multi-sort, if necessary; for example using the command sort -s -t " " -k1,1 -k3,3g output_file (thanks to Christopher Wheat for his kind hint on scientific notation numbers sorting).

HMMER

Download and install HMMER3-3.0 package (http://hmmer.janelia.org/).

Once you have HMMER working on your machines you can launch hmmscan program on your protein dataset with the option for tabular output (--tblout output_file).

An example of command line for launching hmmscan is:

hmmscan --tblout output_file P-fam_database your_protein_sequences

If you want to use both Pfam-A and Pfam-B you have to combine the databases before the search using the following commands:

cat Pfam-A.hmm Pfam-B.hmm > PfamAB.hmm

hmmpress PfamAB.hmm

Argot2 submission

Once you have completed BLAST and/or HMMER searches you need to compress the tabular output files in zip format. The .zip files produced are ready to be submitted to Argot2.

Examples

BLAST

The spaces are the TAB characters

HMMer