Filters applied to k-mers before searching the reference sequence

Filters applied to k-mers before searching the reference sequence

The goal of keeSeek is to generate neverwords suitable to be used as artificial linkers and PCR primers; hence not all the possible combinations of nucleotides are useful. Many k-mers can be discarded a priori, before being searched in the reference genome, thus saving a large part of the required computational time. The applied filters after k-mers generation are listed and detailed in the following sections.

GC content

A balanced GC content is essential for a primer to be functional. For this reason, only k-mers with a GC content between 40% and 60% are retained. In addition, we allow the user to select the exact composition of k-mers that will be generated in terms of nucleotide content. Starting from the values proposed by the user (number of A, C, T, and G), keeSeek will produce all the possible anagrams containing those amounts of different nucleotides. If the specified nucleotide composition is out of the default GC content range, no results will be produced and a warning will appear. To force the production of results, the user needs to adjust the GC content range (-g and –G options) or, alternatively, specify a different nucleotide composition.

3’-end filtering

The 3’-end boundary of a PCR primer is a crucial region which largely determines its specificity. For this reason, a k-mer is discarded if:

It ends with two or more consecutive “weak” bases (no –AA, -TT, -AT, -TA), to prevent “breathing” of ends;
It contains more than three Cs or/and Gs in the last five nucleotides, to avoid mispriming at GC-rich regions of the reference.

Homopolymers

The presence of five or more consecutive identical nucleotides could lead to alignment shifts (“mispriming”), so k-mers containing these homopolymers are discarded.

Palindromes and self-dimerization

If a primer forms hairpins or self-dimers because of the presence of self-complementary regions within its sequence or between couples of identical sequences, respectively, the PCR could be inhibited or the production of primer dimers could be favored. To avoid this issue, we have implemented a “longest common subsequence” algorithm (LCS): we compute all possible gapless alignments between a sequence and itself, each time evaluating the number of matches (see Figures below). If this value exceeds a fixed threshold (e.g. eight for 20-mers) the sequence is discarded. An additional check is performed towards the 3’ end of the sequences: if the last position in the alignment is a match, and four out of the last five positions are matches, the sequence is discarded.

The candidate sequence is aligned to its reverse complement and a matrix is built, assigning the scores 1 or 0 to each match or mismatch, respectively. The scores in the last column and in the last row, highlighted in green, are the sums of all the values in the corresponding diagonals, as shown by the blue boxes. If one of these sums is greater than the selected threshold (set to eight for 20-mer) the candidate is discarded.

This last figure descrivbes the evaluation of the self-priming ability of the 3’-ends: in the same matrix, all the scores resulting from the alignment of the last five nucleotides of both forward (blue boxes) and reverse complement (orange boxes) sequences are calculated (from the region highlighted in yellow). Then, if the last base of the alignment (red circles) is a match, we discard sequences with scores higher than three.

Sequence complexity

The presence of repeated motifs within a primer produces a reduction in sequence complexity and, as a consequence, an increased probability of binding to low-complexity regions along the reference genome. We filter such sequences by looking at the occurrence of small motifs (up to the length of 6) and assigning a score each time a repeat is found. An intuitive representation of one of such filters (searching for repeats up to 3-mers) is depicted below.

Melting temperature

The melting temperature of a candidate k-mer is evaluated and a filter is applied to reject neverwords that are largely unsuitable as PCR primers. Melting temperature is calculated with the Nearest Neighbor method and the SantaLucia table with DNA/DNA thermodynamic parameters (SantaLucia, et al., 1996). The formula is the following:

where sums of the enthalpy (ΔH_d) and entropy (ΔS_d) are calculated for all internal nearest-neighbor doublets and ΔS_self is the entropic penalty for self-complementary sequences. ΔH_i and ΔS_i are the sums of initiation enthalpies and entropies (from SantaLucia table), R is the gas constant, and C_t is the molar concentration.

Note about filters

All the filters implemented in keeSeek are optimized for the production of primer-like k-mers in the range of lengths 16-26. Word lengths outside these boundaries (keeSeek lower bound is eight, upper bound is 31) are hardly compatible with primer design and, as a consequence, the filters are not useful for such sequences. If the aim is the exhaustive generation of all neverwords at a defined word size, without any specific interest in PCR primer generation, the filters need to be switched off (-z option, applied by default for word lengths lower than 18 and higher than 24).