Bioinformatics. Группа авторов

Читать онлайн книгу.

Bioinformatics - Группа авторов


Скачать книгу
(2017). Expansion of the Gene Ontology knowledgebase and resources. Nucleic Acids Res. 45: D331–D338.

      8 Green, E.D., Rubin, E.M., and Olson, M.V. (2017). The future of DNA sequencing. Nature. 550: 179–181.

      9 Karsch-Mizrachi, I., Tagaki, T., and Cochrane, G., on behalf of the International Nucleotide Sequence Database Collaboration (2018). The International Nucleotide Sequence Database Collaboration. Nucleic Acids Res. 46: D48–D51.

      10 Kim, H.J., Kim, N.C., Wang, Y.D. et al. (2013). Mutations in prion-like domains in hnRNPA2B1 and hnRNPA1 cause multisystem proteinopathy and ALS. Nature. 495: 467–473.

      11 Kodama, Y., Mashima, J., Kosuge, T. et al. (2018). DNA Data Bank of Japan: 30th anniversary. Nucleic Acids Res. 46: D30–D35.

      12 Landrum, M.J., Lee, J.M., Benson, M. et al. (2016). ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44: D862–D868.

      13 Lee, R.Y.N., Howe, K.L., Harris, T.W. et al. (2018). WormBase 2017: molting into a new stage. Nucleic Acids Res. 46: D869–D874.

      14 Lipman, D.J. and Pearson, W.R. (1985). Rapid and sensitive protein similarity searches. Science. 227: 1435–1441.

      15 Liu, Q., Shu, S., Wang, R.R. et al. (2016). Whole-exome sequencing identifies a missense mutation in hnRNPA1in a family with flail arm ALS. Neurology. 87: 1763–1769.

      16 Rigden, D.J. and Fernández, X.M. (2018). The 2018 Nucleic Acids Research database issue and the online molecular biology database collection. Nucleic Acids Res. 46: D1–D7.

      17 Silvester, N., Alako, B., Amid, C. et al. (2018). The European Nucleotide Archive in 2017. Nucleic Acids Res. 46: D36–D40.

      18 Smith, C.L., Blake, J.A., Kadin, J.A. et al., and The Mouse Genome Database Group (2018). Mouse Genome Database (MGD)-2018: knowledgebase for the laboratory mouse. Nucleic Acids Res. 46: D836–D842.

      19 Suzek, B.E., Wang, Y., Huang, H. et al., and The UniProt Consortium (2015). UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 31: 926–932.

      20 UniProt Consortium (2017). UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45: D158–D169.

      This chapter was written by Dr. Andreas D. Baxevanis in his private capacity. No official support or endorsement by the National Institutes of Health or the United States Department of Health and Human Services is intended or should be inferred.

       Andreas D. Baxevanis

      On April 14, 2003, the biological community celebrated the achievement of the Human Genome Project's major goal: the complete, accurate, and high-quality sequencing of the human genome (International Human Genome Sequencing Consortium 2001; Schmutz et al. 2004). The attainment of this goal, which many have compared to landing a person on the moon, has had a profound effect on how biological and biomedical research is conducted and will undoubtedly continue to have a profound effect on its direction in the future. The availability of not just human genome data, but also human sequence variation data, model organism sequence data, and information on gene structure and function provides fertile ground for biologists to better design and interpret their experiments in the laboratory, fulfilling the promise of bioinformatics in advancing and accelerating biological discovery.

      One of the most important databases available to biologists is GenBank, the annotated collection of all publicly available DNA and protein sequences (Benson et al. 2017; see Chapter 1). This database, maintained by the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH), represents a collaborative effort between NCBI, the European Molecular Biology Laboratory (EMBL), and the DNA Data Bank of Japan (DDBJ). At the time of this writing, GenBank contained over 200 million sequences and over 300 trillion nucleotide bases. The completion of human genome sequencing and the sequencing of an ever-expanding number of model organism genomes, as well as the existence of a gargantuan number of sequences in general, provides a golden opportunity for biological scientists, owing to the inherent value of these data. However, at the same time, the sheer magnitude of data presents a conundrum to the inexperienced user, resulting not just from the size of the “sequence information space” but from the fact that the information space continues to get larger and larger – by leaps and bounds – at a pace that will continue to accelerate, even though human genome sequencing has long been “completed.”

Graph depicts the exponential growth of GenBank in terms of number of nucleotides and number of sequences submitted.

      One of the most widely used interfaces for the retrieval of information from biological databases is the NCBI Entrez system. Entrez capitalizes on the fact that there are pre-existing, logical relationships between the individual entries found in numerous public databases. For example, a paper in PubMed may describe the sequencing of a gene whose sequence appears in GenBank. The nucleotide sequence, in turn, may code for a protein product whose sequence is stored in NCBI's Protein database. The three-dimensional structure of that protein may be known, and the coordinates for that structure may appear in NCBI's Structure database. Finally, there may be allelic or structural variants documented for the gene of interest, cataloged in databases such as the Single Nucleotide Polymorphism Database (called dbSNP) or the Database of Genomic Structural Variation (called dbVAR), respectively. The existence of such natural connections,


Скачать книгу