GOAnno - HELP and Supplementary information

Go back to GOAnno
Sommaire

1- Program overview

2- Query protein functional subfamily determination using a MACS

3- GO annotation
3.1- Initial Protein gene Ontology, IPO
3.2- Proximal Protein gene Ontology, PPO
3.3- Mean Subfamily gene Ontology, MSO
3.4- Global Protein gene Ontology, GPO

4- The light version of GOAnno
4.1- System requirement
4.2- Input format
4.3- Command line

1- Program overview


Recent efforts in high-throughput sequencing have given rise to a rapid increase in the amount of sequences available in the public databases. Systematic characterization and annotation of this data is typically performed by the Gene Ontology (GO), a hierarchical and standardized vocabulary developed by the GO Consortium. GOAnno is a web tool for automated predicted protein GO annotation. The originality of the method lies in the determination of relationships to already annotated proteins based on Multiple Alignments of Complete Sequences (MACS) organized into distinct functional subfamilies. The members within such subfamilies are conserved enough to filter, enrich and propagate GO terms using the GOAnno algorithm. The second originality is the absence of an arbitrary GO level decision by the user.


2- Query protein functional subfamily determination using a MACS


This preliminary step incorporates the strategy used in PipeAlign, a toolkit for protein family analysis (web site). The cascade of the five PipeAlign analysis programs yields a hierarchised MACS of protein homologues clustered into potential functional subfamilies.

In GOAnno, the following parameters are used: BlastP database searches are performed in UniProt database with expect value 100.0 and a maximum of 2000 matching proteins. The 200 top hits of Ballast are allowed in the final MACS.

3- GO annotation


GOAnno incorporates a four step process. It is independently applied for each of the three GO categories: cellular component, molecular function and biological process. At the end of each step the duplicated and parent GO terms are systematically removed.

3.1- Initial Protein gene Ontology, IPO


The GO deduced from the conversion tables available from the GO Consortium (InterPro, Pfam, Prints, PRODOM, Prosite, SMART protein motifs, Enzyme Commission numbers and SWISS-PROT keywords to GO nodes) and the native GO annotation define the Initial Protein gene Ontology (IPO).

3.2- Proximal Protein gene Ontology, PPO


The construction of the MACS permits the identification of the Proximal Proteins (proteins sharing at least 98 percent identity with the input protein).

All the IPO terms of these proximal proteins constitute the Proximal Protein gene Ontology (PPO).

3.3- Mean Subfamily gene Ontology, MSO


If the alignmentwithin the query subfamily is of high-quality, NorMD value > 0.3, the MSO (Mean Subfamily gene Ontology) is defined by the GO terms that can be reasonably propagated to all the proteins of the query subfamily.

All IPO terms of the proteins are collected to build the corresponding GO tree.

For each IPO term, all the paths to the root are decomposed into linear branches.

For the jth node of the ith branch, a score V(i,j) is calculated from the higher level to the root (GO level 0).

V(i,j) represents the number of proteins that are annotated with this term, which is added to the score of the previous node, child term V(i,j+1). Considering a branch i with n(i) nodes, Vmax(i), the maximum branch score and VMAX, the maximum score of all the I branches are given by:





The whole branch i is removed if it contains an insufficient number of the subfamily proteins, lower than f percent, according to the following condition:




A node j of a branch i is eliminated if it contains less than p percent of all the proteins of the branch i, according to the following equation:




The cut-off values, f and p (GPO parameters), are set to 50% and 80% by default.
GO terms which pass these selections define the MSO.


3.4- Global Protein gene Ontology, GPO


Finally, the previously determined IPO, PPO and MSO terms are collected to define the final GPO (Global Protein gene Ontology) that is assigned to the input protein.

4- The light version of GOAnno


A light version of GOAnno is available for local use. The program allows batch processing of a gene list, which is of particular interest in interpreting high-throughput experiments such as microarray transcription profiling.

4.1- System requirement


If you wish to download GOAnno, you need the following:
- Windows or Linux.
- a web access (needed for the requests into the UniProt database using SRS at the IGBMC web site).
-"molecular function", "biological process" and "cellular component" flat file available at the GO consortium web site.
- all the conversion tables flat file (external2go) available at the GO consortium FTP web site.

4.2- Input format


The relationship of the query entry to homologous annotated proteins should be previously determined by the user (e.g. from a BlastP search) to define the proximal and subfamily proteins.

Input flat file format:

IPO Acc1
PPO Acc2 Acc3 Acc4
MSO Acc1 Acc2 Acc3 Acc4 Acc5 Acc6 ...

Where Acc1 corresponds to the UniProt accession number of the input query protein.
Where Acc2 Acc3 Acc4 corresponds to the UniProt accession numbers of the proximal proteins with the query protein.
Where Acc2 Acc3 Acc4 Acc5 Acc6 corresponds to all the UniProt accession numbers of the proteins of the query subfamily.

4.3- Command line


GOAnno -i [FileIn]

Others command line options:
-i [FileIn] (Input file)
-o [FileOut] (XML report output file Optional -> Default="[FileIn].xml")
-f (Threshold for f parameter [0-1]  Optional -> Default=0.5)
-p (Threshold for p parameter [0-1]  Optional -> Default=0.8)
-H (Help)
Go back to GOAnno