keeSeek is a command line tool for generating, for a given reference genome, a set of k-mers absent in that genome. We call them neverwords.
keeSeek requires a CPU with the SSE4 instruction set and a Nvidia graphic card. There is an option (-N) to omit GPU computations but it is not recommended, being GPGPU a strong advantage.
If you want to try keeseek on a desktop (or notebook) you must install the latest Nvidia drivers; do not install the drivers provided by Linux repositories. We strongly recommend to run keeSeek on professional GPU cards (Tesla or better). Moreover, for analyses of large genomes (> 120 MB) professional cards are required (CPU version can be used but is slower). Here you can find a list with the GPUs that we were able to test.
The Linux binary requires glibc >= 2.4 and glibc++ >= 3.4.15.
Filename | Filesize | Last modified |
---|---|---|
keeSeek_0.4.0_linux64.tar.bz2 | 427.7 KiB | 2014/04/16 12:05 |
keeSeek_0.4.0_src_linux64.tar.bz2 | 297.9 KiB | 2014/04/16 11:58 |
Extract the tarball, then type:
./keeSeek.sh <reference_genome.fasta> <outputfile> [options]
FOR THE IMPATIENTS: have a look to USAGE EXAMPLES. Otherwise continue to read the full documentation.
keeSeek offers a lot of command line options. To simplify usability, important options are grouped together in the “BASIC OPTIONS” section below and are shown when using ”-h” option. Just type ”-h -h” to see also the advanced options.
By default when using “sequential” and “anagram” modes, candidate k-mers are reshuffled by applying a mask generated from a fixed seed. This behavior is controlled by the option -R (see below for details) and was implemented in order to speed up the time required to produce neverwords with high sequence complexity.
The algorithm steps are:
Reshuffling is useful also in anagram mode since the changes between successive steps are mostly local, as in sequential mode.
You could also have a look to KEESEEK ADVANCED OPTIONS.
Some command lines to run keeSeek in different conditions are listed below. Note that the sample reference genome we provide is pretty small, so these jobs are not computationally demanding. If you want to try with larger genomes, please have a look to the Expected time to see the first results section.
keeSeek.sh reference_genome.fasta results.tsv -w 20
keeSeek.sh reference_genome.fasta results.tsv -a 5:5:5:5 -t 10
keeSeek.sh reference_genome.fasta -s ACGTACGTGTCTGCTC
keeSeek.sh reference_genome.fasta -i input_sequences.txt
keeSeek.sh reference_genome.fasta results.tsv -w 10 -d 3
keeSeek.sh reference_genome.fasta results.tsv -a 5:5:5:5 -d 3 -K 1000000 -p -t 100
keeSeek.sh reference_genome.fasta results.tsv -a 5:5:5:5 -d 3 -K 1000000 -p -t 100 -r
keeSeek.sh reference_genome.fasta -w 8 -K 10 -I saved_search.ini
keeSeek.sh -I saved_search.ini -r
The longer the k-mer size, the higher the number of different k-mers existing for that size. For example, there are 420 possible k-mers of length 20. Evaluating such a huge number of k-mers and reporting, for each of them, the minimum distance from the reference genome, can really take a lot of time. For this reason keeSeek is an anytime algorithm, since it allows to retrieve the results as they are produced. You can stop the computation (Ctrl-C), look at the results in the output file, and optionally restart the computation from the last evaluated k-mer with the -r option.
Note: k-mers are processed in blocks (option -k) that cannot be interrupted. If you press Ctrl-C, keeSeek will stop after the block under evaluation is finished; therefore the bigger the block the longer you will have to wait to stop keeSeek.
Marco Falda, Paolo Fontana, Luisa Barzon, Stefano Toppo and Enrico Lavezzo. keeSeek: searching distant non-existing words in genomes for PCR-based applications. Bioinformatics (2014) DOI
keeSeek by M. Falda, E. Lavezzo and S. Toppo is licensed under the Q Public Licence 1.0.