Bioinformatics. Группа авторов

Читать онлайн книгу.

Bioinformatics - Группа авторов


Скачать книгу
is represented only once, RefSeq is, by definition, non-redundant. Nucleotide and protein sequences in RefSeq are explicitly linked to one another. Most importantly, RefSeq entries undergo ongoing curation, assuring that the RefSeq entry represents the most up-to-date state of knowledge regarding a particular DNA, mRNA, or protein sequence.

      RefSeq entries are distinguished from other entries in GenBank through the use of a distinct accession number series. RefSeq accession numbers follow a “2 + 6” format: a two-letter code indicating the type of reference sequence, followed by an underscore and a six-digit number. Experimentally determined sequence data are denoted as follows:

NT_123456 Genomic contigs (DNA)
NM_123456 mRNAs
NP_123456 Proteins

      Reference sequences derived through genome annotation efforts are denoted as follows:

XM_123456 Model mRNAs
XM_123456 Model proteins

      It is important to understand the distinction between the “N” numbers and “X” numbers – the former represent actual, experimentally determined sequences, while the latter represent computational predictions derived from the raw DNA sequence.

      Additional types of RefSeq entries, along with more information on the RefSeq project, can be found on the NCBI RefSeq web site.

      As most sequence data in protein databases are derived from the translation of nucleotide sequences, they can be, in large part, thought of as “secondary databases.” Universal protein sequence databases cover proteins from all species, whereas specialized protein sequence databases concentrate on particular protein families, groups of proteins, or those from a specific organism. Representative model organism databases include the Mouse Genome Database (MGD; Smith et al. 2018) and WormBase (Lee et al. 2018), among others (Baxevanis and Bateman 2015; Rigden and Fernández 2018). Organismal sequence databases are discussed in greater detail in Chapter 2.

      Universal protein databases can be divided further into two broad categories: sequence repositories, where the data are stored with little or no manual intervention, and curated databases, in which experts enhance the original data through expert biocuration. The importance of ensuring interoperability, creating and implementing standards, and adopting best practices aimed at accurately representing the biological knowledge found within the sequence databases is absolutely paramount. Indeed, these curation goals are so important that there is an organization called the International Society for Biocuration, the primary mission of which is to advance these central tenets.

      The NCBI Protein Database

      UniProt

Snapshot depicts the results of a search for the human heterogeneous nuclear ribosomal protein A1 record within UniProtKB, using the accession number P09651 as the search term.
Скачать книгу