Table of Contents

NeSSie: nucleic-acids elements of sequence symmetry identification

NeSSie is a c/c++ 64 bit program that allows to perform fast patterns search and sequence analyses on DNA strings using the NeSSie library.

The tool currently offers the following types of analyses:

For more details on the implemented algorithms and features see NeSSie algorithms and features.

Downloads

FilenameFilesizeLast modified
NeSSie_Library.zip228.3 KiB2018/06/26 17:00
NeSSie_igv.tar.gz2.3 MiB2018/01/25 18:12
NeSSie_materials.tar.gz67.9 MiB2018/01/26 14:44
nessie_CentOS_6.91.1 MiB2018/01/25 18:04
nessie_Debian_6.0.101.2 MiB2018/01/25 18:01
nessie_OSX_HighSierra_10.13.2662.1 KiB2018/01/26 14:53
  1. NeSSie_igv.tar.gz: Mycobacterium bovis data for genome browser visualization.
  2. NeSSie_materials.tar.gz: supplementary materials.
  3. NeSSie_Library.zip: NeSSie Library (source code).
  4. nessie_CentOS_6.9: binary file compiled on CentOS 6.9
  5. nessie_OSX_HighSierra_10.13.2: binary file compiled on on OSX High Sierra 10.13.2
  6. nessie_Debian_6.0.10: binary file compiled on Debian 6.0.10

Contacts

Michele Berselli, berselli.michele@gmail.com

github: https://github.com/B3rse/nessie


Publication: Michele Berselli, Enrico Lavezzo, Stefano Toppo; NeSSie: a tool for the identification of approximate DNA sequence symmetries, Bioinformatics, bty142, https://doi.org/10.1093/bioinformatics/bty142

Install from source

A make file is available in the NeSSie folder to easily compile the program. To set things up:

This will generate the binary file nessie inside the folder.

Quick guide

This is a quick reference guide, to see more details on specific analyses check the documentation provided with the download.


Input

The program accepts for input both fasta and multi-fasta files.

note: the program can handle uppercase and lowercase letters, as well as the presence of N in the input sequence. It cannot handle IUPAC symbols different from canonical bases A, C, T, G.


Basic command lines

where -k is the length of the motifs or k-mers to be searched and N is a positive integer.

note: the shown arguments are required in the correct order, the additional arguments can be instead specified in any order!


Additional arguments for all searches


Additional arguments for -P/-M/-A/-L/-T

note: a maximum length is required if the -MAX parameter is used, a minimum length is required for -P/-M/-A/-T searches.


Additional arguments for -P/-M/-T


Additional arguments for -E/-L


Additional arguments for -T


Output format

Different output files are generated depending on the type of analysis.

  >SEQUENCE_1_NAME
  $|12|AGAAGAAGAAGA
  @counts: 6
  @indexes: 2|5|8|11|14|17|
  $|10|TCTTCCTTCT
  @counts: 2
  @indexes: 30|42
  >SEQUENCE_1_NAME
  !MOTIF_1_NAME
  $|3|AAA
  @counts: 3
  @indexes: 3|4|34|
  !MOTIF_2_NAME
  $|3|CCC
  @counts: 2
  @indexes: 12|20|

where $|12|AGAAGAAGAAGA reports the retrieved motif and its length, @counts: 6 reports the number of occurrences for the motif and @indexes: 2|5|8|11|14|17| reports the indexes at which the motif was found (i.e. positions in the sequence). A new block starting with >SEQUENCE_NAME is created for each of the target sequences if a multi-fasta is provided as input. !MOTIF_NAME is the name of the motif to be searched as provided in the fasta/multi-fasta file with motifs.

  >SEQUENCE_NAME
  $|21|AAAAAAAATAGATCAAATAAA|0101011001010110010111
  @counts: 1
  @indexes: 61577|
  $|10|AAAAAAAATA|0110010101
  @counts: 2
  @indexes: 133061|805355|

if degenerated motifs are searched, the additional field 0101011001010110010111 is reported. This represents the encoding of the best alignment retrieved for the corresponding sequence, and it is used by the NeSSie output parser (see next section) to explicitly print the alignment if desired. 00 and 11 represent indels, 01 represent a match, 10 represent a mismatch.

  >SEQUENCE_1_NAME
  @0-91: 0.891170431211499
  >SEQUENCE_2_NAME
  @0-30: 0.9459459459459459

where @0-91 is the interval on which the score 0.891170431211499 is calculated.

  >SEQUENCE_1_NAME
  @0-30
  0       0.7755102040816326
  5       0.9795918367346939
  10      0.8775510204081632
  15      0.9387755102040817
  20      0.8775510204081632

where @0-30 is the total length of the interval on which the scores are calculated. 0 0.7755102040816326 reports the relative starting index of the window and the corresponding calculated score. A new block starting with >SEQUENCE_NAME is created for each of the target sequences if a multi-fasta is provided as input.

  >SEQUENCE_1_NAME
  @0-30
  0       0.4854752972273343      A:6     C:0     G:4     T:0
  5       0.8427376486136672      A:5     C:1     G:3     T:1
  10      0.9609640474436811      A:2     C:2     G:4     T:2
  15      0.7609640474436811      A:0     C:4     G:4     T:2
  20      0.6804820237218405      A:0     C:4     G:1     T:5

where @0-30 is the total length of the interval on which the scores are calculated. 0 0.4854752972273343 A:6 C:0 G:4 T:0 reports the relative starting index of the window, the corresponding calculated score and the bases composition of the sequence. A new block starting with >SEQUENCE_NAME is created for each of the target sequences if a multi-fasta is provided as input.

To reduce the output file, the -c flag can be used to report only counts while the -i flag can be used to report only indexes.

A log file that contains information on errors occurred during the analysis is produced as well as output in the working directory.

NeSSie output parser (NessieOutParser.py)

Together with NeSSie it is also provided a python script that can be used to better organize the raw output obtained for the search of mirror and palindromic motifs, as well as the motifs with a DNA-triplex forming potential.

In the presence of N the sequence is splitted into blocks and each block analysed separately. If the same motif is detected in different blocks, the hit will be reported for every block with the associated indexes at which it is found in that block. This can lead to a redundancy of some hits in the results. The parser allows to join this redundant motifs together under one hit while ordering the results. The results can be ordered:

note: the score is calculated as the length of the motif minus the number of mismatches and gaps. Mismatch or gap opening scores -2, while mismatch or gap extension scores -1.

The parser allows also to generate an output where the retrieved best alignments are explicitated using the -a flag. Finally the -g flag will generate a GFF format file to simplify the visualization of the results using a genome browser.

NessieOutParser.py -i path/input/file -o path/output/file [-c] [-a] [-g] [-s]

note: Python 2.7 is required.

To wig (to_wig.py)

Together with NeSSie it is also provided a python script that can be used to better visualize the raw output obtained for the analysis of the sequence entropy and complexity. The script allows to generate from the output a WIG format file that can be visualized using a genome browser.

to_wig.py -i path/input/file -o path/output/file

note: Python 2.7 is required.

To tab formatted file (to_tabformat.py)

Together with NeSSie, a python script is also provided that can be used to organize the raw output obtained for the search of mirror, palindromic and DNA-triplex forming motifs in a tab formatted file. The results can be ordered by indexes (default), by score (-s) or by counts (-c).

to_tabformat.py -i path/input/file -o path/output/file [-c] [-s]

note: Python 2.7 is required.

Genome browser visualization example

To view an example of how data are visualized using a genome browser:

Alternatively, the reference genome (fasta) and the track are stored in the files directory and can be loaded in a normal IGV session.

The example shows data from the Mycobacterium bovis genome:

note: when generating the GFF, a color code is assigned to motifs based on their score. From lower to higher scores the colors are red, yellow, blue, green in the order.

License

Copyright (C) 2017 Michele Berselli

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

See http://www.gnu.org/licenses/ for more informations.

Libraries

NeSSie uses the libraries: