Genome-based peptide fingerprint scanning
Genome-based peptide fingerprint scanning (GFS) is a system in bioinformatics analysis that attempts to identify the genomic origin (that is, what species they come from) of sample proteins by scanning their peptide-mass fingerprint against the theoretical translation and proteolytic digest of an entire genome.[1] This method is an improvement from previous methods because it compares the peptide fingerprints to an entire genome instead of comparing it to an already annotated genome.[2] This improvement has the potential to improve genome annotation and identify proteins with incorrect or missing annotations.
History and background
GFS was designed by Michael C. Giddings (University of North Carolina, Chapel Hill) et al., and released in 2003. Giddings expanded the algorithms for GFS from earlier ideas. Two papers were published in 1993 explaining the techniques used to identify proteins in sequence databases. These methods determined the mass of peptides using mass spectrometry, and then used the mass to search protein databases to identify the proteins [3][4] In 1999 a more complex program was released called Mascot that integrated three types of protein/database searches: peptide molecular weights, tandem mass spectrometry from one or more peptide, and combination mass data with amino acid sequence.[5] The fallback with this widely used program is that it is unable to detect alternative splice sites that are not currently annotated, and it not usually able to find proteins that have not been annotated. Giddings built upon these sources to create GFS which would compare peptide mass data to entire genomes to identify the proteins. Giddings system is able to find new annotations of genes that have not been found, such as undocumented genes and undocumented alternative splice sites.
Research examples
In 2012 research was published where genes and proteins were found in a model organism that could not have been found without GFS because they had not been previously annotated. The planarian Schmidtea mediterranea has been used in research for over 100 years. This planarian is capable of regenerating missing body parts and is therefore emerging as potential model organism for stem cell research. Planarians are covered in mucus which aids in locomotion, in protecting them from predation, and in helping their immune system. The genome of Schmidtea mediterranea is sequenced but mostly un-annotated making it a prime candidate for genome-based peptide fingerprint scanning. When the proteins were analyzed with GFS 1,604 proteins were identified. These proteins had mostly not been annotated before they were found with GFS They were also able to find the mucous subproteome (all the genes associated with mucus production). They found that this proteome was conserved in the sister species Schmidtea mansoni. The mucous subproteome is so conserved that 119 orthologs of planarians are found in humans. Due to the similarity in these genes the planarian can now be used as a model to study mucous protein function in humans. This is relevant for infections and diseases related to mucous aberrancies such as cystic fibrosis, asthma, and other lung diseases. These genes could not have been found without GFS because they had not been previously annotated.[6]
In February 2013, proteogenomic mapping research was done with ENCODE to identify translational regions in the human genome. They applied peptide fingerprint scanning and MASCOT to the protein data to find regions that may not have been previously annotated as translated in the human genome. This search against the whole genome revealed that approximately 4% of unique peptide that they found were outside of previously annotated regions. Also the comparison of the whole genome revealed 15% more hits than from a protein database search (such as MASCOT) alone. GFS can be used as a complementary method for annotation due to the fact that you can find new genes or splice sites that have not been annotated before. However it is important to remember that the whole genome approach used by GFS can be less sensitive than programs that look only at annotated regions.[7]
References
- Giddings, M. C.; Shah, A. A.; Gesteland, R.; Moore, B. (2003). "Abstract of Genome-based peptide fingerprint scanning". PNAS. 100 (1): 20–25. doi:10.1073/pnas.0136893100. PMC 140871. PMID 12518051.
- Shinoda, Kosaku; Nozomu Yachie; Takeshi Masuda; Naoyuki Sugiyama; Masahiro Sugimoto; Tomoyoshi Soga; Masaru Tomita (29 October 2006). "HybGFS: a hybrid method for genome-fingerprink scanning". BMC Bioinformatics. 7: 479. doi:10.1186/1471-2105-7-479. PMC 1643838. PMID 17069662.
- Henzel, W J; T M Billeci; J T Stults; S C Wong; C Grimley; C Watanabe (1 June 1993). "Identifying proteins from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases". PNAS. 90 (11): 5011–5015. Bibcode:1993PNAS...90.5011H. doi:10.1073/pnas.90.11.5011. PMC 46643. PMID 8506346.
- Mann, Matthais; Peter Højrup; Peter Roepstorff (June 1993). "Use of mass spectrometric molecular weight information to identify proteins in sequence databases". Biological Mass Spectrometry. 22 (6): 338–345. doi:10.1002/bms.1200220605. PMID 8329463.
- Perkins, David N.; Darryl J. C. Pappin; David M. Creasy; John S. Cottrell (1 December 1999). "Probability-based protein identification by searching sequence databases using mass spectrometry data". Electrophoresis. 20 (18): 3551–3567. doi:10.1002/(sici)1522-2683(19991201)20:18<3551::aid-elps3551>3.0.co;2-2. PMID 10612281.
- Bocchinfuso, Donald G. (September 2012). "Proteomic Profiling of the Planarian Schmidtea mediterranea and Its Mucous Reveals Similarities with Human Secretions and Those Predicted for Parasitic Flatworms". Molecular & Cellular Proteomics. 11 (9): 681–91. doi:10.1074/mcp.M112.019026. PMC 3434776. PMID 22653920.
- Khatun, Jainab (February 2013). "Whole human genome proteogenomic mapping for ENCODE cell line data: identifying protein-coding regions". BMC Genomics. 14: 141. doi:10.1186/1471-2164-14-141. PMC 3607840. PMID 23448259.