FunTaxIS-lite working pipeline

Responsive image — **Figure 1a:** Schematic representation of the FunTaxIS-lite pipeline. The picture shows the different steps necessary to generate the taxon constraints.

1 GOA database cleaning
The first step requires a cleaning process of the raw Gene Ontology Annotation (GOA), provided as a GO Annotated File (GAF) format by the Gene Ontology Consortium (GOC). The cleaning step is required to remove either uninformative or peculiar data. For instance, annotations with the curator evidence code “ND” annotations (No biological Data available), with “NOT” qualifier (protein does NOT perform a specific function), root GO terms (GO:0005575, GO:0008150 and GO:0003674), and annotations with tag “RNAcentral” and “environmental samples” are removed from the database.

2 Taxonomic Reference Nodes determination
A pivotal step in the FunTaxIS-lite pipeline is to determine a list of taxonomic reference nodes to be used to group organisms and their respective annotation contribution, that share common biological features. For each reference taxonomic node, a list of taxonomic constraints is then produced and inherited by all organisms subsumed by that reference taxonomic node. Our approach to identifying reference taxonomic nodes for functional constraints involves balancing two objectives:
1) having a reliable set of constraints for each species subsumed by the corresponding reference taxon;
2) covering a broad range of the taxonomy hierarchy.
This resulted in a total of 240 reference taxonomic nodes representative of all the metabolic/signaling pathways performed by the organisms they subsume. For example, it is crucial to have a solid list of disallowed functions (e.g. photosynthesis for animals or nervous system for plants). However, due to the inhomogeneity of distribution of annotations in the GOA, the annotation coverage varies widely among taxonomic ranks, leading to two categories of reference taxonomic nodes:
1) "Reliable taxonomic nodes" are highly representative nodes in the taxonomy hierarchy that group well-annotated branches thanks to the presence of model organisms that are extensively studied and rich in functional features.
2) "Unreliable taxonomic nodes" are generic nodes in the taxonomy hierarchy with the purpose of grouping poorly annotated branches with limited available knowledge to generate a strong set of constraints.

3 Grouping GOs and cumulative frequencies calculation
The annotations present in the GOA database are grouped by organisms and for each of them the list of all the associated GO terms is produced, including details about the frequency, evidence code, and ontology. All species and their annotations are traced back to their “closest” reference parent node. Subsequently, for each GO term of the reference node, the cumulative frequency over its descendants is calculated.

4 Creation of “never in” GO Taxon Constraints
Once the cumulative frequency of each GO term is obtained for every reference taxon node, the “never-in” constraints are generated following two main steps:

● Cut-off 500 : Only GO terms with a cumulative frequency in the whole GOA that is >= 500 are considered. The cut-off of 500 has been chosen after different tests on a range of thresholds (see Supplement 3) and guarantees that a certain GO term is relatively spread in the GOA and can be reasonably considered as a candidate for which a constraint can be defined for certain species. Finally, a GO term for which the cumulative frequency in a reference node is 0 is tagged as “never-in” for that reference node. Figure S1b provides an illustration of how the cumulative frequency of each GO term for each species contributes to its taxonomic reference node.

● Propagation : the “never-in” generated in the previous step are then propagated down the GO graph following the “true path rule” that governs the graph, i.e. if a parent node is “false” then its child nodes inherit the ”false” property. As a result, GO terms that may have been discarded by the aforementioned cut-off of 500 could be recovered as “never-in” for that taxonomic reference node (Figure S1c).

5 Merging automatic, consortium and manual constraints In the final step of FunTaxIS-lite, the automatic constraints generated by the program are combined with those provided by the Gene Ontology Consortium (referred to as "consortium constraints"). The consortium constraints take precedence over the automatic ones and can override them as needed. Unlike the constraints generated by FunTaxIS-lite, which are only "never-in", the consortium can also provide the "only-in" constraints. "only-in" constraints can contain additional information, as they can be used to annotate only organisms from a specific taxon, which automatically makes these GO terms "never-in" for all other species in the taxonomy tree. As a result, the "only-in" constraints are converted into "never-in" and added to the existing constraints generated by FunTaxIS-lite to increase the total number of constraints. Additionally, to correct errors caused by incorrect annotations in the GOA, a brief list of "manual constraints" is created based on first-hand observation of annotation issues. These constraints are given the highest priority. In summary, the tool assigns priorities to the different types of constraints as follows: "manual constraints" have the highest priority, followed by "consortium constraints", and finally, "automatic constraints" have the lowest priority.

Taxonomic Reference Nodes

All the 171 Taxonomic Reference Nodes are reported in the table below."Reliable taxonomic nodes" are highly representative nodes in the taxonomy hierarchy that group well-annotated branches thanks to the presence of model organisms that are extensively studied and rich in functional features. "Unreliable taxonomic nodes" are generic nodes in the taxonomy hierarchy with the purpose of grouping poorly annotated branches with limited available knowledge to generate a strong set of constraints.

Taxon Reference Node ID	Taxon Reference Name	Node
1	root	unreliable
10239	Viruses	unreliable
74201	Verrucomicrobia	reliable
1890424	Synechococcales	reliable
1150	Oscillatoriales	reliable
2726947	Mycosphaerellales	unreliable
2607030	Caprimulgimorphae	unreliable
204455	Rhodobacterales	unreliable
1385	Bacillales	reliable
8783	Palaeognathae	unreliable
1489892	Anabantaria	reliable
91561	Artiodactyla	reliable
84998	Coriobacteriia	unreliable
2759	Eukaryota	unreliable
69541	Desulfuromonadales	unreliable
5125	Hypocreales	reliable
3524	Caryophyllales	unreliable
85012	Streptosporangiales	reliable
33630	Alveolata	unreliable
85006	Micrococcales	reliable
85011	Streptomycetales	reliable
72041	Eumalacostraca	unreliable
1644060	Natrialbales	reliable
6448	Gastropoda	unreliable
2	Bacteria	unreliable
85008	Micromonosporales	reliable
2499411	Articulavirales	unreliable
1521262	Polypodiidae	unreliable
6072	Eumetazoa	unreliable
3502	Fagales	unreliable
206351	Neisseriales	unreliable
85004	Bifidobacteriales	reliable
1489904	Carangaria	unreliable
6447	Mollusca	reliable
1644055	Haloferacales	reliable
6157	Platyhelminthes	unreliable
8825	Neognathae	reliable
2691354	Pirellulales	unreliable
1853229	Chitinophagales	reliable
135614	Xanthomonadales	unreliable
33554	Carnivora	reliable
33341	Polyneoptera	unreliable
29	Myxococcales	unreliable
766	Rickettsiales	unreliable
5819	Haemosporida	reliable
28211	Alphaproteobacteria	unreliable
186802	Eubacteriales	reliable
222544	Sordariomycetidae	reliable
4069	Solanales	reliable
7147	Diptera	reliable
6935	Ixodida	reliable
526525	Erysipelotrichales	unreliable
8976	Galliformes	unreliable
9126	Passeriformes	reliable
5178	Helotiales	unreliable
92860	Pleosporales	unreliable
9205	Pelecaniformes	unreliable
9443	Primates	reliable
200644	Flavobacteriales	reliable
135622	Alteromonadales	unreliable
118964	Deinococcales	reliable
1970189	Marinilabiliales	reliable
8459	Testudines	unreliable
2836	Bacillariophyta	unreliable
72274	Pseudomonadales	reliable
9397	Chiroptera	reliable
4892	Saccharomycetales	reliable
1643688	Leptospirales	unreliable
4734	commelinids	unreliable
186628	Characiphysae	unreliable
1118	Chroococcales	reliable
117571	Euteleostomi	reliable
1338369	Dipnotetrapodomorpha	unreliable
72273	Thiotrichales	unreliable
3699	Brassicales	reliable
4055	Gentianales	unreliable
41705	Protacanthopterygii	reliable
7399	Hymenoptera	reliable
94695	Methanosarcinales	unreliable
356	Hyphomicrobiales	unreliable
75739	Eucoccidiorida	reliable
32003	Nitrosomonadales	unreliable
1737405	Tissierellales	unreliable
135613	Chromatiales	unreliable
9989	Rodentia	reliable
80840	Burkholderiales	unreliable
41944	Myrtales	unreliable
1161	Nostocales	reliable
1437197	Petrosaviidae	unreliable
29000	Pucciniomycotina	unreliable
768507	Cytophagales	reliable
452284	Ustilaginomycotina	unreliable
5338	Agaricales	reliable
6236	Rhabditida	reliable
118969	Legionellales	unreliable
183924	Thermoprotei	unreliable
2235	Halobacteriales	reliable
314145	Laurasiatheria	unreliable
85009	Propionibacteriales	reliable
34395	Chaetothyriales	unreliable
213118	Desulfobacterales	unreliable
41938	Malvales	reliable
1760	Actinomycetia	reliable
186826	Lactobacillales	reliable
203491	Fusobacteriales	reliable
5234	Tremellales	unreliable
2283794	Methanomada group	unreliable
1643682	Geodermatophilales	reliable
2085	Mycoplasmatales	reliable
135624	Aeromonadales	unreliable
41768	Ranunculales	unreliable
204458	Caulobacterales	unreliable
68295	Thermoanaerobacterales	reliable
1706369	Cellvibrionales	unreliable
3744	Rosales	unreliable
3313	Pinidae	unreliable
7524	Hemiptera	reliable
28738	Cyprinodontiformes	reliable
136	Spirochaetales	unreliable
8906	Charadriiformes	unreliable
72025	Fabales	reliable
2731360	Heunggongvirae	unreliable
41937	Sapindales	unreliable
9263	Metatheria	unreliable
38820	Poales	reliable
200795	Chloroflexi	reliable
73496	Asparagales	unreliable
4209	Asterales	unreliable
71239	Cucurbitales	unreliable
5042	Eurotiales	unreliable
311790	Afrotheria	reliable
171549	Bacteroidales	reliable
33183	Onygenales	unreliable
2045261	Rhodymeniophycidae	unreliable
213115	Desulfovibrionales	unreliable
909929	Selenomonadales	unreliable
7088	Lepidoptera	reliable
2037	Actinomycetales	reliable
8111	Perciformes	reliable
2692248	core chlorophytes	unreliable
85007	Corynebacteriales	reliable
1489922	Eupercaria	unreliable
1028384	Glomerellales	reliable
135625	Pasteurellales	unreliable
3193	Embryophyta	reliable
135619	Oceanospirillales	unreliable
1489911	Cichliformes	unreliable
232347	Magnoliidae	unreliable
135623	Vibrionales	unreliable
2732396	Orthornavirae	unreliable
91347	Enterobacterales	unreliable
2704949	Trypanosomatida	reliable
203683	Planctomycetia	unreliable
7041	Coleoptera	reliable
8342	Anura	unreliable
41945	Ericales	unreliable
4036	Apiales	unreliable
2887326	Moraxellales	unreliable
204441	Rhodospirillales	unreliable
213849	Campylobacterales	unreliable
85010	Pseudonocardiales	reliable
8509	Squamata	reliable
9108	Gruiformes	unreliable
3646	Malpighiales	unreliable
200666	Sphingobacteriales	reliable
8826	Anseriformes	unreliable
4143	Lamiales	unreliable
1843489	Veillonellales	unreliable
204457	Sphingomonadales	unreliable
206389	Rhodocyclales	unreliable
1236	Gammaproteobacteria	unreliable
4776	Peronosporales	unreliable

Efficacy of the taxonomic constraints. Some examples

The output generated by FunTaxIS-lite is useful for both curators and biologists i) to investigate specific functions oddly absent in some taxa, ii) to spot and then remove possible errors in the database, and iii) to refine the output of automatic protein function prediction tools. For the latter, FunTaxiS-lite has been assessed using Argot2.5 (Lavezzo et al.), our in-house tool for automated protein function prediction. We chose 4 different species (Latimeria chalumnae, Amborella trichopoda, Pseudomonas fluorescens SBW25 and Saccharomyces kudriavzevii IFO 1802) and performed whole-proteome annotation for each of them by simulating knowledge loss for these organisms. The GO terms predicted have been filtered out using FunTaxIS-lite constraints.

For Latimeria chalumnae (NCBI taxid 7897), 5314 “never-in” constraints were generated by FunTaxIS-lite, and during the prediction 452 of them were found in Argot2.5’s result, hence were filtered out. Some examples:
- GO:0006696, ergosterol biosynthetic process
- GO:0044823, retroviral integrase activity
- GO:0033072, vancomycin biosynthetic process
- GO:0039654, fusion of virus membrane with host endosome membrane
- GO:0009011, starch synthase activity
- GO:1900231, regulation of single-species biofilm formation on inanimate substrate
- GO:0039693, viral DNA genome replication
- GO:0009372, quorum sensing
- GO:0030115, S-layer
- GO:0030153, bacteriocin immunity
- GO:0009061, anaerobic respiration

For Amborella trichopoda (NCBI taxid 13333), 22279 “never-in” constraints were generated by FunTaxIS-lite, and during the prediction 887 of them were found in Argot2.5’s result, hence were filtered out. Some examples:
- GO:0030573, bile acid catabolic process
- GO:0006695, cholesterol biosynthetic process
- GO:0042953, lipoprotein transport
- GO:0009246, enterobacterial common antigen biosynthetic process
- GO:0015948, methanogenesis
- GO:0030283, testosterone dehydrogenase [NAD(P)] activity
- GO:0044010, single-species biofilm formation
- GO:0019062, virion attachment to host cell
- GO:0019083, viral transcription
- GO:0005902, microvillus

For Pseudomonas fluorescens SBW25 (NCBI taxid 216595), 26917 “never-in” constraints were generated by FunTaxIS-lite, and during the prediction 3170 of them were found in Argot2.5’s result, hence were filtered out. Some examples:
- GO:0042383, sarcolemma
- GO:0001772, immunological synapse
- GO:0042476, odontogenesis
- GO:0060250, germ-line stem-cell niche homeostasis
- GO:0050873, brown fat cell differentiation
- GO:0043005, neuron projection
- GO:0009887, animal organ morphogenesis
- GO:0030220, platelet formation
- GO:0070997, neuron death
- GO:0035994, response to muscle stretch
- GO:0060325, face morphogenesis
- GO:0043588, skin development
- GO:0030097, hemopoiesis

For Saccharomyces kudriavzevii IFO 1802 (NCBI taxid 226230), 22009 “never-in” constraints were generated by FunTaxIS-lite, and during the prediction 389 of them were found in Argot2.5’s result, hence were filtered out. Some examples:
- GO:0015979, photosynthesis
- GO:0001568, blood vessel development
- GO:0009579, thylakoid
- GO:0006703, estrogen biosynthetic process
- GO:0042953, lipoprotein transport
- GO:0009740, gibberellic acid mediated signaling pathway
- GO:0004687, myosin light chain kinase activity
- GO:0047696, beta-adrenergic receptor kinase
- GO:0106277, biliverdin reductase (NADP+) activity
- GO:0009736, cytokinin-activated signaling pathway

Last updates

All the taxonomic constraints have been generated starting from the latest available releases of GO, GOA and Taxonomy Tree (Last update: 2024-01-19 11:40:20).