Human T-lymphotropic virus 2


Conservation of G-quadruplexes and virus genomes

The presence of a conserved G-quadruplex (G4) pattern within a less conserved genome environment could be an indication of a G4 with a biological function. Here, we report in a graphical form the relationship between these two features calculated among the available strains. To allow for a fast evaluation of G4 conservation in the local genomic context, the “G4 scaffold conservation index” (G4_SCI) was computed. This value measures the degree of conservation of G-islands that are necessary and sufficient to form a G4: the higher the score, the higher the conservation of the G4 pattern. All G4s are plotted as vertical bars, whose height and position represent the G4_SCI on the y-axis and the genome coordinates on the x-axis, respectively. In addition, the local sequence conservation (LCS) of the viral genome, the sequence complexity and the sequence entropy are reported alongside.

  • Local sequence conservation: it is the sequence conservation among all strains available for each virus, calculated with a sliding window approach.
  • Sequence entropy: it is the Shannon entropy, a measure that calculates unbalances in base composition in the sequence, calculated with a sliding window approach.
  • Sequence complexity: it is a measure of the “vocabulary richness” of a sequence, calculated as the level of repetitions of its k-mers (words of length k) in a given window. The more complex the sequence is, the richer is the oligonucleotide vocabulary it contains.
  • Combined score: it is the mean between sequence entropy and complexity.


note 1: the graphs are interactive. By clicking on the legend is possible to add and remove tracks, while hovering over the curves returns point values. It is also possible to zoom in by selecting a specific area of the graphs.

note 2: when more than four G-islands are found complying with the maximum distance allowed between consecutive islands, only one bar is plotted in the middle of the region. Nevertheless, multiple distinct G4s or isoforms could form.

note 3: further links are available on the top of each chart, redirecting to interactive pages for the visualization of G-quadruplexes in multiple alignments and annotated reference sequences. Please note that boundaries of G4 regions are determined by the multiple alignments and some of the islands might not be present in the reference genome itself. Consider this when visualizing the annotated reference.


GG islands

G-Quadruplexes on the MULTIPLE ALIGNMENT:
        - full sequence

G-Quadruplexes on the ANNOTATED REFERENCE:
        - full sequence



GGG islands

G-Quadruplexes on the MULTIPLE ALIGNMENT:
        - full sequence

G-Quadruplexes on the ANNOTATED REFERENCE:
        - full sequence



GGGG islands

G-Quadruplexes on the MULTIPLE ALIGNMENT:
        - full sequence

G-Quadruplexes on the ANNOTATED REFERENCE:
        - full sequence




Statistical evidence for the calculated G4s in the human virus genomes

The presence of G4 patterns may be largely affected by G/C content, which greatly varies in viral genomes. To check whether the presence of putative G4s is potentially relevant or whether it occurs by pure chance, different simulated datasets were generated for each virus as follows:

  • 10,000 sequences, with the same length of the reference, randomized at the single nucleotide level (same composition but different order of nucleotides)
  • 10,000 sequences, with the same length of the reference, where GG island positions in the positive strand were randomized
  • 10,000 sequences, with the same length of the reference, where GG island positions in the negative strand were randomized
  • 10,000 sequences, with the same length of the reference, where GGG island positions in the positive strand were randomized
  • 10,000 sequences, with the same length of the reference, where GGG island positions in the negative strand were randomized
  • 10,000 sequences, with the same length of the reference, where GGGG island positions in the positive strand were randomized
  • 10,000 sequences, with the same length of the reference, where GGGG island positions in the negative strand were randomized

The presence of G4 was evaluated in all of these simulated sequences and the results were compared with those obtained from real data. The significance of the observed numbers of G4s with respect to real viral genomes is calculated with the mid-P value. The results are presented as segment diagrams, where each segment is referred to one of the three G-island types (GG, GGG, GGGG) considered in the positive and negative strands, while its length corresponds to 1 minus the mid-P value. Full segments indicate highly significant G4s, whereas null segments indicate non-significant G4s. The statistical significance is set to 0.1 (0.05 for each tail) and is shown as a grey area within the charts (slices that exceed this area are significant). The color indicates the tail of the distribution to which mid-P values are referred: green and red mean that G4s are enriched in real or in simulated sequences, respectively.


Randomization at the single nucleotide level



Randomization of GG, GGG and GGGG islands





Dept. Molecular Medicine, University of Padova

contact: enrico.lavezzo@unipd.it, berselli.michele@gmail.com

g4virus/htlv_2.txt · Last modified: 2018/07/31 12:43 (external edit)
Recent changes RSS feed
Public Domain
Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki