Contents
- About P-POD
- Help with using the tool
- Data sources
- Database schema
- Download data and software
- How to cite P-POD
- Contact us
About P-POD
The Princeton Protein Orthology Database (P-POD), developed by the Genome Databases Group at Princeton, displays families of predicted orthologs from S. cerevisiae, H. sapiens, M. musculus, D. rerio, D. melanogaster, C. elegans, A. thaliana, and P. falciparum, with an emphasis on providing information about disease-related genes. Disease-related information is collected from the Online Mendelian Inheritance in Man (OMIM) database, the Saccharomcyes Genome Database (SGD), and manual literature curation. For more information, please refer to the paper describing P-POD.
Querying the web interface with a protein from one of the eight model organisms retrieves a phylogenetic tree of putative orthologous proteins, a list of diseases associated with the human ortholog(s), a list of papers associated with the yeast ortholog(s) and labeled as "disease-related" at SGD, and a manually curated and annotated list of papers with cross-complementation experiments involving the yeast ortholog(s). You may also search or browse the results by OMIM disease ID numbers.
Results from two types of comparative genomics analysis are provided as query options:
- OrthoMCL analysis (UPenn): generates predicted ortholog groups by first seeding the group with a reciprocal best hit ortholog pair, (the main ortholog pair), then building the group by adding in-paralogs. These in-paralogs arise when duplication occurs after speciation; the duplicated gene often still retains the function of the ortholog. OrthoMCL excludes from the ortholog groups out-paralogs, which are genes that duplicated before speciation and are likely to be functionally diverged. In-paralogs are clustered around the main ortholog from each species independently, with the criterion used for adding the in-paralogs being that all in-paralogs from the same species are more similar to each other than to any sequence in the other species. Thus, users should recognize that in families where in-paralogs are found, it is not possible to identify simply the true ortholog. See Li et al (2003) for details.
- Jaccard Coefficient clustering analysis (TIGR): generates large families of related sequences. In the Jaccard clustering analysis, two peptides are grouped into the same family if they share a significant number of homologs, calculated as follows. First, a list of homologs for each peptide, consisting of those peptides whose relative BLAST scores are less than 1e-5 over a total of at least 50% of the length of both peptides, is generated for each peptide. Then the Jaccard index for each pair of peptides is calculated; this is the ratio of the magnitude of the intersection of their homolog sets vs. the union, or |A∩B| / |A∪B| . After experimenting with a range of values, we chose a minimum Jaccard index of 0.5 as the cutoff for including two peptides in the same cluster. See the Jaccard-clustering analysis section of the documentation page provided by TIGR for more information.
Each family generated using either the OrthoMCL or Jaccard Coefficient method is then analyzed by ClustalW and PHYLIP to generate the corresponding sequence alignments and dendrograms as indicated below:
All the data within the database are freely and publicly available through the web and by downloading the entire database system (contact us for download information). Currently, the analysis pipeline uses OrthoMCL v1.5 to generate ortholog families, Jaccard clustering to generate "super families" (large families of related sequences), ClustalW v1.83 to generate sequence alignments, and PHYLIP v3.65 to determine the phylogenetic relationship among the family members. Note that the phylogenetic tree is arbitrarily rooted by the PHYLIP program. The system is designed in a modular way so that different components can be plugged into (or removed from) the analysis pipeline. For example, two alternative methods, OrthoMCL and Jaccard clustering, are used to generate different types of sequence families, though the analysis pipeline shares downstream components. The Generic Model Organism Database (GMOD) schema is used as the backend database.
We gratefully acknowledge Mike Cherry (SGD), Shuai Weng (SGD), Eurie Hong (SGD), Sam Angiuoli (TIGR), Don Gilbert (Indiana University), Chris Stoeckert (UPenn), Feng Chen (UPenn), Scott Cain (CSHL), Laurie Kramer (Princeton) and John Matese (Princeton) for valuable discussions.
Help with using this tool
A. Search by gene/protein name
B. Search by disease
C. Browse OMIM disease families
D. Browse families by organism
A. Search by gene/protein name
Search option 1 allows you to query for predicted ortholog or super family results using a gene/protein name or accession identifier. The query options and results are described below.Query options:
1) Enter a gene/protein identifier in the text box and select the
organism of the query protein using the pull-down menu. Valid search entries for each
organism included in the analysis are listed in the table below:
Organism Source Database Valid gene/protein identifier(s) Examples P.falciparum PlasmoDB PlasmoDB ID PF08_0034 H.sapiens ENSEMBL ENSEMBL peptide ID, peptide name ENSP00000266970, CDK2 D.melanogaster FlyBase FlyBase ID CG17520-PA, CkIIalpha-PA M.musculus ENSEMBL ENSEMBL peptide ID ENSMUSP00000068896 A.thaliana TAIR TAIR identifier or gene name AT1G25490.1, PAB4 C.elegans WormBase WormBase identifier or gene name C09G4.1, dbr-1 D.rerio ENSEMBL ENSEMBL peptide ID, ZFIN ID ENSDARP00000007117, ZDB-GENE-040808-60 S.cerevisiae SGD ORF name or gene name YNL098C, DPM1
2) Select either "OrthoMCL" from the pull-down menu to view a family that contains only putative orthologs for the gene/protein of interest, or else "Jaccard clustering" to view a larger super family of related sequences. (Please see the section "About this tool" for more details on each method used.)
Query results:
1) If the OrthoMCL option was selected, then a family of putative orthologs is shown in a phylogenetic tree display with direct links to the source database for each gene/protein in the family. Please note that the phylogenetic tree is arbitrarily rooted by the PHYLIP program. A link is also provided at the top of the page to view the corresponding Jaccard Coefficient results that contain a larger family of sequences related to the query gene/protein.
2) Disease information obtained from the Online Mendelian Inheritance in Man (OMIM) database is provided if a human gene displayed in the results has a corresponding gene or phenotype OMIM record. A link to the OMIM entry is provided via the OMIM record number in the disease information table. Please note that the OMIM records were obtained from Ensembl BioMart and may not reflect all current changes to the OMIM database.
3) The results also include a list of papers associated with yeast protein(s) in the family that address the topics "Disease-related" or "Cross-species Expression" in the Saccharomyces Genome Database (SGD) Literature Guide. Papers that address cross-species expression were manually curated to find experimental evidence that confirms or refutes the prediction of orthology calculated using the OrthoMCL method. If a paper shows cross-species complementation in which a gene/protein from one species complements the corresponding mutation in another species, then this is considered experimental evidence of orthology. The curator notes indicate whether orthology was directly tested via cross-species complementation, or whether only heterologous expression was carried out.
4) A ClustalW alignment of the protein sequences in the family is also provided with the gene/protein identifers linked to their corresponding source databases. The symbols and color-coding indicate either strong similarity (:), weak similarity (.), or identical (*) residues between sequences. The sequence alignment (.aln file) or the actual protein sequences in FASTA format may be downloaded from the links provided.
B. Search by disease
Search option 2 allows you to query for predicted ortholog or super family results using an OMIM ID. To find an OMIM ID that matches a disease of interest, you can 1) browse OMIM disease families from this analysis to find their OMIM IDs, or 2) search the OMIM database itself, which contains OMIM IDs that are in this database and many other OMIM IDs. Query results are provided for those families where a human gene/protein has a corresponding gene or phenotype OMIM record. A link to the OMIM entry is provided via the OMIM record number in the disease information table. Please note that the OMIM records were obtained from Ensembl BioMart and may not reflect all current changes to the OMIM database.
C. Browse OMIM disease families
At the bottom of the homepage, a link is provided for browsing OMIM disease families. You may browse a list of OMIM IDs corresponding to human disease genes that encode proteins found in sequence families from this analysis. Clicking on one of the links in the OMIM value index displays the OMIM gene or phenotype description, along with the relevant human Ensembl peptide ID and links to its corresponding OrthoMCL or Jaccard cluster results.
D. Browse families by organism
At the bottom of the homepage, a link is also provided for browsing the OrthoMCL family and Jaccard clustering results by organism. The table gives the distribution of families based on the organism(s) they include and provides links to the OrthoMCL families and Jaccard clusters containing the different subgroups of organisms. Clicking on the number of OrthoMCL families or Jaccard clusters for each subset of organisms displays links to the actual results. Direct links for the sequence families are listed if there are less than ten for a particular subset of organisms.
Data sources
Protein data sets
The sources of each protein data set and the numbers of sequences analyzed are listed in the table below. All files were downloaded November 14, 2005. Note that some files may have been replaced with a more recent version at the source database. Clicking on a file name will link you to the directory containing the file. You can also download the fasta sequence files from us; see the Download data and software section below for more information.
| Organism | Number of proteins | Database | File name |
|---|---|---|---|
| S. cerevisiae | 6704 | SGD | orf_trans_all.fasta.gz |
| H. sapiens | 33869 | ENSEMBL | Homo_sapiens.NCBI35.nov.pep.fa.gz |
| M. musculus | 36471 | ENSEMBL | Mus_musculus.NCBIM34.nov.pep.fa |
| D. rerio | 32143 | ENSEMBL | Danio_rerio.ZFISH5.nov.pep.fa |
| D. melanogaster | 19178 | FlyBase | dmel-all-translation-r4.2.1.fa |
| C. elegans | 22858 | WormBase | wormpep150.fa |
| A. thaliana | 30690 | TAIR | TAIR6_pep_20051108.fa |
| P. falciparum | 5363 | PlasmoDB | Pfa3D7_WholeGenome_Annotated_PEP_2005.2.11.fa |
OMIM diseases
The OMIM diseases and their associated ENSEMBL peptide IDs were downloaded on April 24, 2006 from two sources:
- NCBI: downloaded mim2gene.txt file from the NCBI ftp site, and used the Batch Entrez tool to retrieve disease descriptions.
- ENSEMBL BioMart: retrieved Ensembl peptide IDs associated with MIM IDs in ENSEMBL, then used the NCBI Batch Entrez tool to retrieve disease descriptions.
These files were parsed and combined into one file (diseasegenesBiomart_Mim.txt) and used to load the GMOD database.
Literature
Papers flagged as "Disease-related" or "Cross-species expression" were downloaded from SGD on January 13, 2006: view/download Disease-related or Cross-species expression papers. Both files are in the format: ORF[tab]PMID
Database schema
This tool utilizes the GMOD database schema, implemented in a similar way as the Sybil package provided by TIGR. The table below lists the main tables utilized in our implementation; contact us if you need more detailed information.
Note that in our next release, rather than use the Featureloc table, we will use the Phylogeny module, which seems to be a better fit for these data, although original GMOD documentation does describe how to use Feature and Feature loc for these types of analyses (see http://www.gmod.org/schema-cvs/chado/doc/Chado_Schema_Documentation.doc for details).
| Type of Data | GMOD Module | GMOD Table(s) | Notes |
|---|---|---|---|
| Pipeline run | Companalysis, Controlled vocabulary | Analysis, Analysis_Feature, Analysisprop, CV | Pipeline runs are grouped together using the Analysis table, with features (protein sequences and ortholog families) generated from a particular run grouped together by the Analysis_Feature linking table (similar to the Sybil implementation). Different runs (for example, Jaccard Clustering without an alignment constraint vs. Jaccard Clustering with an alignment constraint) are distinguished by different cvterms (ie. different type_ids) in the Analysisprop table. |
| Fasta files | Sequence | Feature, Dbxref | Fasta files are loaded into the feature table, and IDs parsed from the header are loaded into the Dbxref table. |
| ortholog families | Sequence | Feature, Featureloc, Dbxref | Each ortholog family is inserted as a feature (type is OrthoMCL family). Proteins that comprise the family are grouped with the OrthoMCL family using the featureloc table. The Dbxref_id column for these OrthoMCL families is used to refer to the file name of the png image of the phylogenetic tree. |
| ClustalW alignment | Sequence | Feature | The ClustalW alignment of the sequences within the ortholog family is stored in the residues column in the row for the OrthoMCL family feature. |
| Cross-species expression and disease-related literature | Pub, Sequence | Pub, Featureprop, Featureprop_pub | Paper info is stored in the pub table, then the paper, topic, and curated information is linked to the appropriate feature through the Featureprop and Featureprop_pub tables. |
| Disease info from OMIM | Sequence | Dbxref, Feature_dbxref | OMIM disease record IDs are stored in the Dbxref table. Features that are associated with OMIM disease records are linked to the relevant OMIM IDs through the Feature_dbxref table. |
Download data and software
Data
- ppod_db.sql.gz (147 MB): PostgreSQL sql dump of all the data in P-POD.
- fasta.tar.gz (47 MB): fasta records of all input sequences, tarred and gzipped.
- results from all vs. all BLAST: because a file containing all of the results is too large (about 20 GB) for most to conveniently download, please contact us if you would like these data and we will send them to you on DVD.
- taxman.psv (5.9 MB): all the ORTHOMCL
families and their members. This file contains a column header, and
the pipe (|)-separated columns are:
- ORTHOMCL family
- Protein ID
- Species
- Conservation of essential genes
- Essentiality conservation across species: analyzed families
with unambiguous (1:1:1) orthologs for yeast, human, and mouse:
- HTML summary page
- essential_yeast_mouse_human.txt: tab-separated columns are: phenotype, OrthoMCL family (family name, yeast ORF, human ENSEMBL ID), Yeast ORF, Yeast gene description, Mouse gene, Mouse gene (MGI ID) description
- Conservation of essential genes vs. non-essential genes
vs. disease-related genes: used data
downloaded from SGD on 10-10-2006 for yeast phenotypes and parsed the following files:
- taxman.psv: list of all OrthoMCL families
- essentialyeast.txt: list of all essential yeast genes
- nonessentialyeast.txt: list of all non-essential yeast genes
- diseasegenes.txt (see below)
- Essentiality conservation across species: analyzed families
with unambiguous (1:1:1) orthologs for yeast, human, and mouse:
- diseasegenes.txt: contains proteins from all species found in a family that contains a disease-related protein (i.e. proteins associated with a phenotype in the OMIM database). The tab-separated columns are: protein identifier, common name, organism.
- orphans.txt: proteins not found in either a Jaccard Clustering or OrthoMCL family. Tab-separated columns are: Gene name, Protein identifier, organism.
- curation.txt: experimental results from cross-species expression
experiments collected from the literature. Tab-separated columns are:
- Yeast ORF
- Yeast gene name
- PubMed ID
- Ortholog tested: the gene that was transformed from one organism into another.
- Other organism: the organism whose gene is being expressed in the test organism.
- Test organism: the organism in which the complementation experiment was performed.
- Cross complementation experiment: indicates whether cross complementation was demonstrated (yes, no, no expt).
- Note
- Mutated gene: used only in rare cases where the "Other organism" is SACCHAROMYCES_CEREVISIAE; the name of the gene mutated in the test organism.
Software
Some code is available from
the p-pod
project on Source Forge. If you
are interested in other parts of the code base, please contact us so
that we can provide the code that you need in convenient way.
How to cite P-POD
Please cite the following paper (PubMed ID: 17712414):
Heinicke S., Livstone M.S., Lu C., Oughtred R., Kang F., Angiuoli S.V., White O., Botstein D., Dolinski K.(2007) The Princeton Protein Orthology Database (P-POD): A Comparative Genomics Analysis Tool for Biologists.PLoS ONE 2:e766.