Instructions for generating input files for Argot2 batch process
If your goal is to annotate a large set of sequences (or entire genomes) you can use Argot2 batch processing section. In this case we cannot provide the computational resources necessary to accomplish blast and HMMER analysis but we need you to do them by yourself. You will have to submit the output files from BLAST and HMMER (both of them are recommended, but only one is necessary) in the correct format.
Below there is a list of MANDATORY instructions that MUST be followed carefully for a correct generation of Argot2 input files.
Input files for BLAST and HMM
Your sequences must be in fasta format. Header lines must contain a '>' followed by an alphanumeric univocal string, we call unique ID, containing no spaces and not longer than 20 characters (the rest of the eventual comments in the fasta header, separated by one or more white spaces from the unique ID, will not be considered). If you want to perform both BLAST and HMMER searches you need to have protein sequences only and the input file for both these searches must be the same.
Databases
BLAST and HMM searches MUST be performed against Uniprot and P-fam databases, respectively.
You can download Uniprot database at http://www.uniprot.org/ (in our server we use both SwissProt and Trembl datasets from Uniprot).
You can download P-fam database at one of the following sites (in our server we use both Pfam-A and Pfam-B):
BLAST
Download and install ncbi-blast-2.2.24+ (http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download).
Once you have BLAST working on your machines you can launch blastp or blastx
(if you have protein or nucleotide sequences, respectively) with the the option
for custom tabular output (-outfmt "6 qseqid sseqid evalue"
).
An example of command line for launching protein blast is:
blastp -outfmt "6 qseqid sseqid evalue " -query your_sequences -db
Uniprot -out output_file
If you want to merge several Blast results together please ensure that the sequences in the final output file preserve the contiguity property by applying a multi-sort, if necessary; for example using the command sort -s -t " " -k1,1 -k3,3g output_file
(thanks to Christopher Wheat for his kind hint on scientific notation numbers sorting).
HMMER
Download and install HMMER3-3.0 package (http://hmmer.janelia.org/).
Once you have HMMER working on your machines you can launch hmmscan program
on your protein dataset with the option for tabular output (--tblout
output_file
).
An example of command line for launching hmmscan is:
hmmscan --tblout output_file P-fam_database
your_protein_sequences
If you want to use both Pfam-A and Pfam-B you have to combine the databases before the search using the following commands:
cat Pfam-A.hmm Pfam-B.hmm > PfamAB.hmm
hmmpress PfamAB.hmm
Argot2 submission
Once you have completed BLAST and/or HMMER searches you need to compress the tabular output files in zip format. The .zip files produced are ready to be submitted to Argot2.
Examples
BLAST
The spaces are the TAB characters