NeSSie algorithms and features

NeSSie algorithms and features

Global alignment algorithm

Imperfection-tolerant palindromes and mirrors may contain mismatches but also insertions/deletions that impair the symmetry of the motifs. This poses the problem of finding the best symmetry point that divides the sequence in two self-complementary arms. To accomplish this task, we implemented a strategy based on dynamic programming applied to sliding windows as follows.

1) The input sequence is scanned with a sliding window, whose length is defined by the user, shifting along the sequence one base at a time.

2) For each sliding window a global alignment approach based on Needleman-Wunsch algorithm [1] is applied to search for potential symmetries with the following different modalities:

Motifs of length equal to sliding window size and satisfying the selected scoring cutoffs (-k N parameter of NeSSie)
The entire DNA sequence captured by the sliding window (size defined by -k) is tested with the global alignment algorithm and reported if satisfies the selected scoring cutoffs (degeneracy parameters).

All motifs falling in the range of min/max lengths and satisfying the selected scoring cutoffs (-k N, -K N parameters of NeSSie)
All the sequences from max to min length (window size defined by –K, max length) are tested with the global alignment algorithm and only those that satisfy the selected scoring thresholds (degeneracy parameters) are reported at each position.

Longest max scoring motif falling in the range of min/max lengths and satisfying the selected scoring cutoffs (-MAX, -k N, -K N parameters of NeSSie)
All the sequences from max to min length (window size defined by –K, max length) are tested with the global alignment algorithm and only the longest sequence that satisfies also the selected scoring thresholds (degeneracy parameters) is reported at each position.

3) To manage insertions and deletions that impair the symmetry, we implemented a modified version of the Needleman-Wunsch algorithm for global alignment, which can identify the best symmetry point and generate the optimal alignment between the two arms of the symmetric motif. The algorithm steps are the following:

The sequence and its inverted copy are placed in the horizontal and vertical axes of the matrix respectively.
The alignment matrix is filled according to Needleman-Wunsch algorithm.
Two different scoring matrices are used to test for the mirror and palindromic symmetry; gap opening score is -1.
The backtracking to retrieve the optimal alignment starts from the highest scoring cell found along the diagonal connecting top-right cell and bottom-left cell. This boundary defines a constraint for the entire sequence to be aligned.

This example shows the result of a mirror search in the sequence “ATCAAGTTGCCA”. The best possible alignment is obtained following the optimal path as shown by arrows where the green cell containing the highest score along the diagonal is the starting point of backtracking.

Shannon entropy

The Shannon entropy [2] is an index derived from the information theory that measures the amount of non-compressible information contained in a message. It can be applied to any symbolic sequence and allows to measure the order state (or entropy) of the sequence by the analysis of the symbols distribution. Practically, when applied to a DNA sequence [3], it allows to detect unbalances in base composition. The entropy is calculated according to the formula:

where p(x_i) represents the probability for the symbol x_i to occur at any position of the sequence and n is the size of the alphabet x (n = 4 for DNA).

Linguistic complexity

The Linguistic Sequence Complexity [4] is an index that measures the vocabulary richness of a sequence. It can be applied to any symbolic sequence and is calculated as the level of repetition of the k-mers (words of length k) in the sequence. The more complex the sequence is, the richer is the vocabulary it contains; vice versa, the lower the complexity is, the more repetitive is the sequence. The complexity is calculated according to the formula [5]:

where V_k is the actual number of different words of length k in the sequence, V_maxk is the maximum number of possible words of length k in the sequence, and N is the maximum length of the words considered in the score calculation. V_maxk is calculated as the min(A^k, L - k + 1), where A is the alphabet size and L is the length of the sequence.

References

[1] S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” J. Mol. Biol., vol. 48, no. 3, pp. 443–453, Mar. 1970.

[2] C. E. Shannon, “A mathematical theory of communication,” Bell Syst. Tech. J., vol. 27, no. 3, pp. 379–423, Jul. 1948.

[3] T. Machado and J. A, “Shannon Entropy Analysis of the Genome Code,” Mathematical Problems in Engineering, 2012. [Online]. Available: https://www.hindawi.com/journals/mpe/2012/132625/. [Accessed: 25-Oct-2017].

[4] E. N. Trifonov, “Making sense of the human genome,” Struct. Methods, vol. 1, Human Genome Initiative and DNA Recombination, pp. 69–77, 1990.

[5] Y. L. Orlov and V. N. Potapov, “Complexity: an internet resource for analysis of DNA sequence complexity,” Nucleic Acids Res., vol. 32, no. Web Server issue, pp. W628-633, Jul. 2004.