keeSeek

keeSeek is a command line tool for generating, for a given reference genome, a set of k-mers absent in that genome. We call them neverwords.

Downloads

keeSeek requires a CPU with the SSE4 instruction set and a Nvidia graphic card. There is an option (-N) to omit GPU computations but it is not recommended, being GPGPU a strong advantage.

If you want to try keeseek on a desktop (or notebook) you must install the latest Nvidia drivers; do not install the drivers provided by Linux repositories. We strongly recommend to run keeSeek on professional GPU cards (Tesla or better). Moreover, for analyses of large genomes (> 120 MB) professional cards are required (CPU version can be used but is slower). Here you can find a list with the GPUs that we were able to test.

The Linux binary requires glibc >= 2.4 and glibc++ >= 3.4.15.


FilenameFilesizeLast modified
keeSeek_0.4.0_linux64.tar.bz2427.7 KiB2014/04/16 12:05
keeSeek_0.4.0_src_linux64.tar.bz2297.9 KiB2014/04/16 11:58

Quick start


Extract the tarball, then type:

./keeSeek.sh <reference_genome.fasta> <outputfile> [options]


FOR THE IMPATIENTS: have a look to USAGE EXAMPLES. Otherwise continue to read the full documentation.


keeSeek offers a lot of command line options. To simplify usability, important options are grouped together in the “BASIC OPTIONS” section below and are shown when using ”-h” option. Just type ”-h -h” to see also the advanced options.


BASIC OPTIONS

Running modes (select only one of the following 4 options)
  1. Sequential mode
    • -w [ --word-length ] arg (=20)        sequentially analyze all possible k-mers of length w, starting from a homopolymer of ‘A’ ([8, 31], default: 20)
  2. Anagram mode
    • -a [ --anagram ] arg        set the initial nucleotide composition in terms of A, C, G, T characters. Separate the numbers with ':' (num_A:num_C:num_G:num_T) and keep the sum of the 4 values in the range [8, 31]; keeSeek will analyze all anagrams of the specified composition
  3. Input one candidate k-mer from command line
    • -s [ --single-sequence ] arg        search a specific k-mer (in [ACGT])
  4. Input candidate k-mers from a file
    • -i [ --input-sequences ] arg        search a specific set of k-mers, reading the sequences from a list in a file (one per line, same length)


By default when using “sequential” and “anagram” modes, candidate k-mers are reshuffled by applying a mask generated from a fixed seed. This behavior is controlled by the option -R (see below for details) and was implemented in order to speed up the time required to produce neverwords with high sequence complexity.

The algorithm steps are:

  • k-mers generation: in “sequential mode” k-mers are generated in lexicographical order starting from a homopolymer of 'A' (e.g. AAA, AAC, AAG, AAT, ACA, etc.); in “anagram mode” k-mers are explored exhaustively starting from a sequence having the chosen symbols composition in alphabetical order (e.g. -a 2:3:4:5 → AACCCGGGGTTTTT);
  • k-mers reshuffling: once a k-mer is generated, its nucleotides order is mixed by applying a reshuffling mask created from a seed. The seed can be fixed (-R 2 by default, but you can choose any seed in the range 1-2e9), random (-R 0) or this step can be disabled (-R -1).
  • filtering: each k-mer undergoes a filtering step to evaluate its primer-like features (see advanced options for details). This step can be disabled with -z option.

Reshuffling is useful also in anagram mode since the changes between successive steps are mostly local, as in sequential mode.


Output options
  • -K [ --max-kmers ] arg (=1024)         set the maximum number of kmers to search before stopping computation ([1, 1e6], default: 1024)
  • -t [ --top-results ] arg (=1024)        set the number of best results to show ([1, 4e9], default: 1024), in order to limit the size of the output file
  • -p [ --percentage ]        show the progress percentage (this could slow down performance)
  • -v [ --verbosity ] arg (=0)        set the verbosity level ([0, 3], default: 0); (this could slow down performance)


Other useful options
  • -d [ --min-distance ] arg (=1)        set the minimum distance from the reference to be sought ([0, w], default: 1), results under this threshold will not be reported
  • -h [ --help ]        show the short help
  • -h -h [ --help --help]        show the full help
  • -N [ --no-GPU ]        Do not use GPU
  • -R [ --random-shuffle ] arg (=2)        set the seed for reshuffling (-1 to turn off, 0 for random, 1 - 2e9 to set a fixed seed; default: 2 for -w and -a, -1 for -s and -i). To increase complexity of neverwords, keeSeek reshuffles the nucleotides positions of the k-mers after their generation (or permutations when using -a) using a mask created from this seed
  • -r [ --resume ]        force the resume from the last saved sequence; be sure to keep the same seed for reshuffling, otherwise the exhaustiveness of the search will be lost (don't use random seed -R 0)
  • -y [ --yield ] arg (=3600)         set the rate, in seconds, at which save the intermediate results ([600, 30879000], default: 3600)
  • -I [ --ini-file ] arg        read the parameters from a INI file
  • -V [ --version ]        print information about the software and its current limitations


You could also have a look to KEESEEK ADVANCED OPTIONS.

Examples

Some command lines to run keeSeek in different conditions are listed below. Note that the sample reference genome we provide is pretty small, so these jobs are not computationally demanding. If you want to try with larger genomes, please have a look to the Expected time to see the first results section.

  1. Simple search for 20-neverwords with primer-like features, results redirected in a file:
    keeSeek.sh reference_genome.fasta results.tsv -w 20
  2. Search for 20-neverwords with primer-like features, starting from a defined nucleotide composition (5A, 5C,5G, and 5T) and evaluating all possible anagrams, at most the top 10 results are redirected in a file:
    keeSeek.sh reference_genome.fasta results.tsv -a 5:5:5:5 -t 10
  3. Search for a specific sequence:
    keeSeek.sh reference_genome.fasta -s ACGTACGTGTCTGCTC
  4. Search from a file containing a list of sequences:
    keeSeek.sh reference_genome.fasta -i input_sequences.txt
  5. Search for 10-neverwords with at least 3 differences, results redirected in a file:
    keeSeek.sh reference_genome.fasta results.tsv -w 10 -d 3
  6. Search for 20-neverwords of defined composition with at least 3 differences, stop after the evaluation of 1 million k-mers, redirect at most the 100 best results in a file and show the progress of computation (this will take long, you can stop with Ctrl-C and look at the partial results):
    keeSeek.sh reference_genome.fasta results.tsv -a 5:5:5:5 -d 3 -K 1000000 -p -t 100
  7. Resume computation of example n. 6 from last evaluated k-mer (keep parameters unchanged, use Ctrl-C to stop):
    keeSeek.sh reference_genome.fasta results.tsv -a 5:5:5:5 -d 3 -K 1000000 -p -t 100 -r
  8. Save the parameters in a INI file (that must not already exist):
    keeSeek.sh reference_genome.fasta -w 8 -K 10 -I saved_search.ini
  9. Load the parameters from an INI file (that must already exist) and overwrite a parameter:
    keeSeek.sh -I saved_search.ini -r

Exhaustiveness of the search

The longer the k-mer size, the higher the number of different k-mers existing for that size. For example, there are 420 possible k-mers of length 20. Evaluating such a huge number of k-mers and reporting, for each of them, the minimum distance from the reference genome, can really take a lot of time. For this reason keeSeek is an anytime algorithm, since it allows to retrieve the results as they are produced. You can stop the computation (Ctrl-C), look at the results in the output file, and optionally restart the computation from the last evaluated k-mer with the -r option.
Note: k-mers are processed in blocks (option -k) that cannot be interrupted. If you press Ctrl-C, keeSeek will stop after the block under evaluation is finished; therefore the bigger the block the longer you will have to wait to stop keeSeek.

Citing this work

Marco Falda, Paolo Fontana, Luisa Barzon, Stefano Toppo and Enrico Lavezzo. keeSeek: searching distant non-existing words in genomes for PCR-based applications. Bioinformatics (2014) DOI




keeSeek by M. Falda, E. Lavezzo and S. Toppo is licensed under the Q Public Licence 1.0.

keeseek.txt · Last modified: 2014/06/05 16:50 by admin
Recent changes RSS feed
Public Domain
Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki