The information comes from several databases and literature sources.
1) Functional Information from GO Data
The GO (GeneOntology)Consortium produces a controlled vocabulary that can be applied to all organisms. The information concerning a particular protein's Biological Process, Molecular Function, and Cellular Component is collected and described in GO ( http://geneontology.org/ontology/gene_ontology_edit.obo ). PDBMLplus at PDBj includes GO annotations for protein chains. The correspondence between PDB chains and GO IDs are extracted from the ID mapping file provided by UniProt ( ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz).
2) Functional Information from PDB Data
The SITE information in the original PDB data is displayed with the type name "SITE". The names of chain and residue are those defined by authors, which appear in the original PDB flat files.
3) Functional Information from PDB atom coordinates for the
"HETATM" binding sites
The ligand binding site information is extracted from every PDB structure as the residues, any atoms of which are close to atoms identified as "HETATM" excluding the residue names of "HOH", "WAT", "PO4", "SO4", "MSE", "TPO", "SEP", "PTR", "HIP", "PAS", "ASQ", NA ", and "CL " less than 5.0 Angstrom. The "HETATM" residues composed of less than three atoms are also not counted. Every functional site is displayed with the type name "binding site". The chain-names and residue numbers are defined by the authors, which appear in the original PDB flat files.
4) Functional Information from PROSITE/UniProt
The amino acid residues identified as the motif sequences by PROSITE/UniProt is annotated from Scan Prosite ( http://www.expasy.org/tools/scanprosite/ ) and displayed with the type name "PROSITE".
5) Functional Information from
The Footnote (FT) information of the corresponding amino acid sequence data in SwissProt/UniProt relating to the molecular function of each protein is collected and displayed with the following type names, respectively:
6) Catalytic Information from CSA
The catalytic information of enzymes is collected and distributed as CSA (Catalytic Site Atlas)database, where the catalytic sites of representative enzyme proteins are annotated by C.T.Porter, G.J.Bartlett, and J.M.Thornton (http://www.ebi.ac.uk/thornton-srv/databases/CSA/). The CSA information is annotated to the individual protein with the site_id and type name as "CSA#" and "catalytic site", respectively, where # is the ID number. The chain-names and residue numbers are defined by the authors, which appear in the original PDB flat files.
7) Catalytic Information from CATRES
The catalytic information of enzymes is also collected and distributed as CATRES (Catalytic Residue Dataset) database, where the catalytic sites of representative enzyme proteins are manually annotated by G.J.Bartlett, C.T.Porter, N. Borkakoti, and J. M. Thornton (http://www.ebi.ac.uk/thornton-srv/databases/CATRES/). The CATRES information is annotated to the individual protein with the site_id and type name as "CATRES#" and "catalytic site", respectively, where # is the ID number. In addition, proteins homologous to the representative enzymes in the original CATRES database are automatically extracted from all the PDB entries and the corresponding chain and residue IDs are displayed using sequence alignment. The extended CATRES information is annotated to the individual protein with the site_id and type name as "extCATRES#" and "catalytic site", respectively, where # is the ID number. The chain-names and residue numbers are defined by the authors, which appear in the original PDB flat files. The procedure to extract the catalytic residues from homologous enzymes is as follows:
1. Extract the PDB sequence from the original CATRES file. (If the catalytic residues span more than one chain, it is skipped at the moment.)
2. Align the query sequence to all the PDB sequences using BLAST.
3. If all catalytic residues are contained within the BLAST alignment and all catalytic residues are conserved (100% identity), the function of the new (template) sequence is likely to be the same as that in the CATRES file.
4. The final determination is made based on the structural similarity of the active-site residues. When the distance RMSD value is less than 3A, the active-site is considered to have the same catalytic function, and the extCATRES file described with XML is stored. The distance RMSD of the active-site atoms is computed as follows:
For each pair of residues (i) in the query, find the pair of atoms with the smallest distance (dq_i). Compute the same distance in the templates (dt_i). Compute rmsd as sqrt( sum_i( (dq_i-dt_i)^2 )/Npair), where Npair is the number of residue pairs.
Questions and comments about the Miner should transmit mail to pdbj-master.