Princeton Protein Orthology Database (P-POD): Help

Contents

About P-POD

The Princeton Protein Orthology Database (P-POD), developed by the Genome Databases Group at Princeton, displays families of predicted orthologs from S. cerevisiae, H. sapiens, M. musculus, D. rerio, D. melanogaster, C. elegans, A. thaliana, and P. falciparum, with an emphasis on providing information about disease-related genes. Disease-related information is collected from the Online Mendelian Inheritance in Man (OMIM) database, the Saccharomcyes Genome Database (SGD), and manual literature curation. For more information, please refer to the paper describing P-POD.

Querying the web interface with a protein from one of the eight model organisms retrieves a phylogenetic tree of putative orthologous proteins, a list of diseases associated with the human ortholog(s), a list of papers associated with the yeast ortholog(s) and labeled as "disease-related" at SGD, and a manually curated and annotated list of papers with cross-complementation experiments involving the yeast ortholog(s). You may also search or browse the results by OMIM disease ID numbers.

Results from two types of comparative genomics analysis are provided as query options:

Each family generated using either the OrthoMCL or Jaccard Coefficient method is then analyzed by ClustalW and PHYLIP to generate the corresponding sequence alignments and dendrograms as indicated below:

[Pipeline Flow Chart]

All the data within the database are freely and publicly available through the web and by downloading the entire database system ( for download information). Currently, the analysis pipeline uses OrthoMCL v1.5 to generate ortholog families, Jaccard clustering to generate "super families" (large families of related sequences), ClustalW v1.83 to generate sequence alignments, and PHYLIP v3.65 to determine the phylogenetic relationship among the family members. Note that the phylogenetic tree is arbitrarily rooted by the PHYLIP program. The system is designed in a modular way so that different components can be plugged into (or removed from) the analysis pipeline. For example, two alternative methods, OrthoMCL and Jaccard clustering, are used to generate different types of sequence families, though the analysis pipeline shares downstream components. The Generic Model Organism Database (GMOD) schema is used as the backend database.

We gratefully acknowledge Mike Cherry (SGD), Shuai Weng (SGD), Eurie Hong (SGD), Sam Angiuoli (TIGR), Don Gilbert (Indiana University), Chris Stoeckert (UPenn), Feng Chen (UPenn), Scott Cain (CSHL), Laurie Kramer (Princeton) and John Matese (Princeton) for valuable discussions.

Help with using this tool

A. Search by gene/protein name
B. Search by disease
C. Browse OMIM disease families
D. Browse families by organism

A. Search by gene/protein name

Search option 1 allows you to query for predicted ortholog or super family results using a gene/protein name or accession identifier. The query options and results are described below.

Query options:
1) Enter a gene/protein identifier in the text box and select the organism of the query protein using the pull-down menu. Valid search entries for each organism included in the analysis are listed in the table below:

Organism Source Database Valid gene/protein identifier(s) Examples
P.falciparum PlasmoDB PlasmoDB ID PF08_0034
H.sapiens ENSEMBL ENSEMBL peptide ID, peptide name ENSP00000266970, CDK2
D.melanogaster FlyBase FlyBase ID CG17520-PA, CkIIalpha-PA
M.musculusENSEMBLENSEMBL peptide IDENSMUSP00000068896
A.thalianaTAIRTAIR identifier or gene name AT1G25490.1, PAB4
C.elegansWormBaseWormBase identifier or gene name C09G4.1, dbr-1
D.rerio ENSEMBL ENSEMBL peptide ID, ZFIN ID ENSDARP00000007117, ZDB-GENE-040808-60
S.cerevisiae SGD ORF name or gene name YNL098C, DPM1

2) Select either "OrthoMCL" from the pull-down menu to view a family that contains only putative orthologs for the gene/protein of interest, or else "Jaccard clustering" to view a larger super family of related sequences. (Please see the section "About this tool" for more details on each method used.)

Query results:
1) If the OrthoMCL option was selected, then a family of putative orthologs is shown in a phylogenetic tree display with direct links to the source database for each gene/protein in the family. Please note that the phylogenetic tree is arbitrarily rooted by the PHYLIP program. A link is also provided at the top of the page to view the corresponding Jaccard Coefficient results that contain a larger family of sequences related to the query gene/protein.

2) Disease information obtained from the Online Mendelian Inheritance in Man (OMIM) database is provided if a human gene displayed in the results has a corresponding gene or phenotype OMIM record. A link to the OMIM entry is provided via the OMIM record number in the disease information table. Please note that the OMIM records were obtained from Ensembl BioMart and may not reflect all current changes to the OMIM database.

3) The results also include a list of papers associated with yeast protein(s) in the family that address the topics "Disease-related" or "Cross-species Expression" in the Saccharomyces Genome Database (SGD) Literature Guide. Papers that address cross-species expression were manually curated to find experimental evidence that confirms or refutes the prediction of orthology calculated using the OrthoMCL method. If a paper shows cross-species complementation in which a gene/protein from one species complements the corresponding mutation in another species, then this is considered experimental evidence of orthology. The curator notes indicate whether orthology was directly tested via cross-species complementation, or whether only heterologous expression was carried out.

4) A ClustalW alignment of the protein sequences in the family is also provided with the gene/protein identifers linked to their corresponding source databases. The symbols and color-coding indicate either strong similarity (:), weak similarity (.), or identical (*) residues between sequences. The sequence alignment (.aln file) or the actual protein sequences in FASTA format may be downloaded from the links provided.

B. Search by disease

Search option 2 allows you to query for predicted ortholog or super family results using an OMIM ID. To find an OMIM ID that matches a disease of interest, you can 1) browse OMIM disease families from this analysis to find their OMIM IDs, or 2) search the OMIM database itself, which contains OMIM IDs that are in this database and many other OMIM IDs. Query results are provided for those families where a human gene/protein has a corresponding gene or phenotype OMIM record. A link to the OMIM entry is provided via the OMIM record number in the disease information table. Please note that the OMIM records were obtained from Ensembl BioMart and may not reflect all current changes to the OMIM database.

C. Browse OMIM disease families

At the bottom of the homepage, a link is provided for browsing OMIM disease families. You may browse a list of OMIM IDs corresponding to human disease genes that encode proteins found in sequence families from this analysis. Clicking on one of the links in the OMIM value index displays the OMIM gene or phenotype description, along with the relevant human Ensembl peptide ID and links to its corresponding OrthoMCL or Jaccard cluster results.

D. Browse families by organism

At the bottom of the homepage, a link is also provided for browsing the OrthoMCL family and Jaccard clustering results by organism. The table gives the distribution of families based on the organism(s) they include and provides links to the OrthoMCL families and Jaccard clusters containing the different subgroups of organisms. Clicking on the number of OrthoMCL families or Jaccard clusters for each subset of organisms displays links to the actual results. Direct links for the sequence families are listed if there are less than ten for a particular subset of organisms.

Data sources

Protein data sets

The sources of each protein data set and the numbers of sequences analyzed are listed in the table below. All files were downloaded November 14, 2005. Note that some files may have been replaced with a more recent version at the source database. Clicking on a file name will link you to the directory containing the file. You can also download the fasta sequence files from us; see the Download data and software section below for more information.

Organism Number of proteins Database File name
S. cerevisiae 6704 SGD orf_trans_all.fasta.gz
H. sapiens 33869 ENSEMBL Homo_sapiens.NCBI35.nov.pep.fa.gz
M. musculus 36471 ENSEMBL Mus_musculus.NCBIM34.nov.pep.fa
D. rerio 32143 ENSEMBL Danio_rerio.ZFISH5.nov.pep.fa
D. melanogaster 19178 FlyBase dmel-all-translation-r4.2.1.fa
C. elegans 22858 WormBase wormpep150.fa
A. thaliana 30690 TAIR TAIR6_pep_20051108.fa
P. falciparum 5363 PlasmoDB Pfa3D7_WholeGenome_Annotated_PEP_2005.2.11.fa

OMIM diseases

The OMIM diseases and their associated ENSEMBL peptide IDs were downloaded on April 24, 2006 from two sources:

These files were parsed and combined into one file (diseasegenesBiomart_Mim.txt) and used to load the GMOD database.

Literature

Papers flagged as "Disease-related" or "Cross-species expression" were downloaded from SGD on January 13, 2006: view/download Disease-related or Cross-species expression papers. Both files are in the format: ORF[tab]PMID

Database schema

This tool utilizes the GMOD database schema, implemented in a similar way as the Sybil package provided by TIGR. The table below lists the main tables utilized in our implementation; contact us if you need more detailed information.

Note that in our next release, rather than use the Featureloc table, we will use the Phylogeny module, which seems to be a better fit for these data, although original GMOD documentation does describe how to use Feature and Feature loc for these types of analyses (see http://www.gmod.org/schema-cvs/chado/doc/Chado_Schema_Documentation.doc for details).

Type of Data GMOD Module GMOD Table(s) Notes
Pipeline run Companalysis, Controlled vocabulary Analysis, Analysis_Feature, Analysisprop, CV Pipeline runs are grouped together using the Analysis table, with features (protein sequences and ortholog families) generated from a particular run grouped together by the Analysis_Feature linking table (similar to the Sybil implementation). Different runs (for example, Jaccard Clustering without an alignment constraint vs. Jaccard Clustering with an alignment constraint) are distinguished by different cvterms (ie. different type_ids) in the Analysisprop table.
Fasta files Sequence Feature, Dbxref Fasta files are loaded into the feature table, and IDs parsed from the header are loaded into the Dbxref table.
ortholog families Sequence Feature, Featureloc, Dbxref Each ortholog family is inserted as a feature (type is OrthoMCL family). Proteins that comprise the family are grouped with the OrthoMCL family using the featureloc table. The Dbxref_id column for these OrthoMCL families is used to refer to the file name of the png image of the phylogenetic tree.
ClustalW alignment Sequence Feature The ClustalW alignment of the sequences within the ortholog family is stored in the residues column in the row for the OrthoMCL family feature.
Cross-species expression and disease-related literature Pub, Sequence Pub, Featureprop, Featureprop_pub Paper info is stored in the pub table, then the paper, topic, and curated information is linked to the appropriate feature through the Featureprop and Featureprop_pub tables.
Disease info from OMIM Sequence Dbxref, Feature_dbxref OMIM disease record IDs are stored in the Dbxref table. Features that are associated with OMIM disease records are linked to the relevant OMIM IDs through the Feature_dbxref table.

Download data and software

Data

Software

Some code is available from the p-pod project on Source Forge. If you are interested in other parts of the code base, please contact us so that we can provide the code that you need in convenient way.

How to cite P-POD

Please cite the following paper (PubMed ID: 17712414):

Heinicke S., Livstone M.S., Lu C., Oughtred R., Kang F., Angiuoli S.V., White O., Botstein D., Dolinski K.(2007) The Princeton Protein Orthology Database (P-POD): A Comparative Genomics Analysis Tool for Biologists.PLoS ONE 2:e766.