In this folder the main scripts used to generate the results presented in the paper "G-quadruplex forming sequences in the genome of all known human viruses: a comprehensive guide" from Lavezzo et al. are reported.
Each script is provided with a short help guide, that can be obtained by running them without arguments. They were not intended to be released as a stand-alone package, thus they are largely not commented and not user-friendly.
In the following, example command line are reported.

STEPS TO FIND THE CONSENSUS OF G4s ON A VIRUS (examples of command lines)

1) G4 PATTERNS PREDICTION (2 files OF G4_coords are generated for Fw and Rw patterns)
Example command line:

findPattern_multifasta.pl -i multiple_alignment.fasta -l 3 -n 4 -d 12 -m 45 -o output_prefix

2) CONSERVATION OF G4s (takes the coords generated at previous step and generates a file containing G4 that are conserved over the selected threshold (5 columns: start, end, max_conservation %, num viral sequences with the G4, num of total viral sequences considered))

FROM THIS SCRIPT, THE COORDINATES OF G4 PATTERNS IN MULTIPLE ALIGNMENTS ARE OBTAINED

*In some cases the "num sequences with the G4" might be greater than "num of total sequences considered". It is a problem related to the boundaries of multiple sequence alignments, but the "max_conservation %" is correct.

matrixScanner_g4.pl -f multiple_alignment.fasta -c coords.G -t 10 -o output_directory_Fw
matrixScanner_g4.pl -f multiple_alignment.fasta -c coords.C -t 10 -o output_directory_Rw

3) ERASE THE GAPS FROM THE COORDINATES OF G4 IN THE RESULTS, SO THE NEW COORDINATES ARE RELATIVE TO THE REPRESENTATIVE SEQUENCE FOR EACH VIRUS AS REPORTED IN THE "Virus.txt" FILE (take the conservation file generated at previous step)

FROM THIS SCRIPT, THE COORDINATES OF G4 PATTERN IN THE REFERENCE SEQUENCE ARE OBTAINED (Please remember that some G4 pattern might not be present in the reference sequence, especially poorly conserved G4s)
Example of command lines (it requires the file of G4 produced in the previous step):

convert_G4_coords_2_no_gaps.pl -m Virus.txt -g4 conservation_G -a alignment_directory -o output_file
convert_G4_coords_2_no_gaps.pl -m Virus.txt -g4 conservation_C -a alignment_directory -o output_file

4) VERIFY THE OVERLAP BETWEEN G4 AND GENOMIC FEATURES (takes the output from the previous step and returns a file of results and one of statistics)
Example of command line (it requires the G4 coordinates referred to the reference sequence and a file containing the number of features for each virus, provided in the folder "Feature_numbers"):

g4_on_genomic_features.py G4_coords ./Feature_numbers/ output_file output_file_stats -i 60

---------------------------------------------------------------

R SCRIPT TO GENERATE SIMULATED VIRAL GENOMES RESHUFFLED AT THE NUCLEOTIDE LEVEL

simulate Human viruses list_conservation G4.R

PYTHON SCRIPT TO GENERATE SIMULATED VIRAL GENOMES BY RESHUFFLING THE G-ISLANDS (DIMERS, TRIMERS AND TETRAMERS)
Example of command line (it requires the multiple alignment file in fasta format and a file the calculated number of G-island in the reference sequence, provided in this folder):

reshuffle_G_islands.py -s multiple_alignment.fas -l 3 -n 10000 -r Stats_TRIMERS_reference.txt