Bioinformatics. Группа авторов
Читать онлайн книгу.Exhaustive matching of the entire protein sequence database. Proteins. 256: 1443–1445.
13 Gribskov, M., McLachlan, A.D., and Eisenberg, D. (1987). Profile analysis: detection of distantly-related proteins. Proc. Natl. Acad. Sci. USA. 84: 4355–4358.
14 Henikoff, S. and Henikoff, J.G. (1991). Automated assembly of protein blocks for database searching. Nucleic Acids Res. 19: 6565–6572.
15 Henikoff, S. and Henikoff, J.G. (1992). Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA. 89: 10915–10919.
16 Henikoff, S. and Henikoff, J.G. (1993). Performance evaluation of amino acid substitution matrices. Proteins Struct. Funct. Genet. 17: 49–61.
17 Henikoff, S. and Henikoff, J.G. (2000). Amino acid substitution matrices. Adv. Protein Chem. 54: 73–97.
18 Jones, D.T., Taylor, W.R., and Thornton, J.M. (1992). The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8: 275–282.
19 Karlin, S. and Altschul, S.F. (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA. 87: 2264–2268.
20 Kent, W.J. (2002). BLAT: the BLAST-like alignment tool. Genome Res. 12: 656–664.
21 Lipman, D.J. and Pearson, W.R. (1985). Rapid and sensitive protein similarity searches. Science. 227: 1435–1441.
22 Ma, B., Tromp, J., and Li, M. (2002). PatternHunter: faster and more sensitive homology search. Bioinformatics. 18: 440–445.
23 Pearson, W.R. (1995). Comparison of methods for searching protein sequence databases. Protein Sci. 4: 1145–1160.
24 Pearson, W.R. (2000). Flexible sequence similarity searching with the FASTA3 program package. Methods Mol. Biol. 132: 185–219.
25 Pearson, W.R. (2016). Finding protein and nucleotide similarities with FASTA. Curr. Protoc. Bioinf. 53: 3.9.1–3.9.23.
26 Pearson, W.R. and Lipman, D.J. (1988). Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA. 85: 2444–2448.
27 Rost, B. (1999). Twilight zone of protein sequence alignments. Protein Eng. 12: 85–94.
28 Ryan, J.F., Pang, K., Schnitzler, C.E. et al., and NISC Comparative Sequencing Program. (2013). The genome of the ctenophore Mnemiopsis leidyi and its implications for cell type evolution. Science. 342: 1242592.
29 Schneider, T.D., Stormo, G.D., Gold, L., and Ehrenfeucht, A. (1986). Information content of binding sites on nucleotide sequences. J. Mol. Biol. 188: 415–431.
30 Schnitzler, C.E., Simmons, D.K., Pang, K. et al. (2014). Expression of multiple Sox genes through embryonic development in the ctenophore Mnemiopsis leidyi is spatially restricted to zones of cell proliferation. EvoDevo. 5: 15.
31 Smith, T.F. and Waterman, M.S. (1981). Identification of common molecular subsequences. J. Mol. Biol. 147: 195–197.
32 Staden, R. (1988). Methods to define and locate patterns of motifs in sequences. Comput. Appl. Biosci. 4: 53–60.
33 Tatusov, R.L., Altschul, S.F., and Koonin, E.V. (1994). Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. Proc. Natl. Acad. Sci. USA. 91: 12091–12095.
34 Tatusova, T.A. and Madden, T.L. (1999). BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol. Lett. 174: 247–250.
35 Török, A., Schiffer, P.H., Schintzler, C.E. et al. (2016). The cnidarian Hydractinia echinata employs canonical and highly adapted histones to pack its DNA. Epigenet. Chromatin. 9: 36.
36 Vogt, G., Etzold, T., and Argos, P. (1995). An assessment of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. J. Mol. Biol. 249: 816–831.
37 Wheeler, D.G. (2003). Selecting the right protein scoring matrix. Curr. Protoc. Bioinf. 1: 3.5.1–3.5.6.
38 Wootton, J.C. and Federhen, S. (1993). Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. 17: 149–163.
39 Zhang, Z., Schwartz, S., Wagner, L., and Miller, W. (2000). A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7: 203–214.
This chapter was written by Dr. Andreas D. Baxevanis in his private capacity. No official support or endorsement by the National Institutes of Health or the United States Department of Health and Human Services is intended or should be inferred.
4 Genome Browsers
Tyra G. Wolfsberg
Introduction
The first complete sequence of a eukaryotic genome – that of Saccharomyces cerevisiae – was published in 1996 (Goffeau et al. 1996). The chromosomes of this organism, which range in size from 270 to 1500 kb, presented an immediate challenge in data management, as the upper limit for single database entries in GenBank at the time was 350 kb. To better manage the yeast genome sequence, as well as other chromosome and genome-length sequences being deposited into GenBank around that time, the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH) established the Genomes division of Entrez (Benson et al. 1997). Entries in this division were organized around a reference sequence onto which all other sequences from that organism were aligned. As these reference sequences have no size limit, “virtual” reference sequences of large genomes or chromosomes could be assembled from shorter GenBank sequences. For partially sequenced chromosomes, NCBI developed methods to integrate genetic, physical, and cytogenetic maps onto the framework of the whole chromosome. Thus, Entrez Genomes was able to provide the first graphical views of large-scale genomic sequence data.
The working draft of the human genome, completed in February 2001 (Lander et al. 2001), generated virtual reference sequences for each human chromosome, ranging in size from 46 to 246 Mb. NCBI created the first version of its human Map Viewer (Wheeler et al. 2001) shortly thereafter, in order to display these longer sequences. Around the same time, the University of California, Santa Cruz (UCSC) Genome Bioinformatics Group was developing its own human genome browser, based on software originally designed for displaying the much smaller Caenorhabditis elegans genome (Kent and Zahler 2000). Similarly, the Ensembl project at the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) was also producing a system to automatically annotate the human genome sequence, as well as store and visualize the data (Hubbard et al. 2002). The three genome browsers all came online at about the same time, and researchers began using them to help navigate the human genome (Wolfsberg et al. 2002). Today, each site provides free access not only to human sequence data but also to a myriad of other assembled genomic sequences, from commonly used model organisms such as mouse to more recently released assemblies such as those of the domesticated turkey. Although the NCBI's Map Viewer is not being further developed and will be replaced by its new Genome Data Viewer (Sayers et al. 2019), the UCSC and Ensembl Genome Browsers continue to be popular resources, used by most members of the bioinformatics and genomics communities. This chapter will focus on the last two genome browsers.
The reference human genome was sequenced in a clone-by-clone shotgun sequencing strategy and was declared complete in April 2003, although sequencing of selected regions is still continuing. This strategy includes constructing a bacterial artificial chromosome (BAC) tiling map for each human chromosome, then sequencing each BAC using a shotgun sequencing approach (reviewed in Green 2001). The sequences of individual BACs were deposited into the High Throughput Genomic (HTG) division of GenBank as they became available. UCSC began assembling these BAC sequences into longer contigs in May 2000 (Kent and Haussler 2001), followed by assembly efforts undertaken at NCBI (Kitts 2003). These contigs, which contained gaps and regions of uncertain order, became the basis for the development of the genome browsers. Over time, as the genome sequence was finished, the human genome assembly was updated every few months. After UCSC stopped producing