Bioinformatics. Группа авторов
Читать онлайн книгу.supporting the annotation. The lower part of Figure 1.3 also shows information regarding the protein's involvement in disease, documenting variants that have been implicated in early onset Paget disease and amyotrophic lateral sclerosis (Kim et al. 2013; Liu et al. 2016).
Figure 1.3 The Subcellular location and Pathology & Biotech sections of the record for the human heterogeneous nuclear ribosomal protein A1 record within UniProtKB. These sections can be accessed by clicking on the blue tiles in the left-hand column of the window. See text for details.
In the upper left corner of the UniProtKB window are display options that are quite useful in visualizing the significant amount of data found in this entry's feature table. By clicking on Feature viewer, one is presented with the view shown in Figure 1.4, neatly summarizing the annotations for this sequence in a coordinate-based fashion. Any of the sections can be expanded by clicking on the labels in the blue boxes to the left of the graphic. Here, the post-translational modification (PTM) section has been expanded, showing the position of modified residues in this protein; clicking on any of the markers in the track will produce a pop-up with additional information on the PTM, along with relevant links to the literature. In Figure 1.5, the Structural features and Variants sections have also been expanded, showing the positions of all alpha helices, beta strands, and beta turns within the protein, as well as the location of putatively clinically relevant point mutations. Here, a variant at position 351 is highlighted, with the proline-to-leucine variant identified as part of the ClinVar project (Landrum et al. 2016) having a possible association with relapsing–remitting multiple sclerosis. By examining different sections of this very useful graphical display, the user can start to see how various features overlap with one another, perhaps indicating whether a known or predicted disease-causing variant falls within a structured region of the protein. These annotations and observations can provide important insights with respect to experimental design and the interpretation of experimental data.
Figure 1.4 The Feature viewer rendering of the record for the human heterogeneous nuclear ribosomal protein A1 within UniProtKB. Clicking the Display link, found in the upper left portion of the window, provides access to the Feature viewer. Any of the sections can be expanded by clicking on the labels in the blue boxes to the left of the graphic. See text for details.
Figure 1.5 Expanding the PTM, Structural features, and Variants sections within the Feature viewer display shows the position of all post-translational modifications (PTMs), alpha helices, beta strands, and beta turns within the human heterogeneous nuclear ribosomal protein A1, as well as the location of putatively clinically relevant point mutations. Clicking on any of the variants produces a pop-up window with additional information; here, the pop-up window provides disease association data for the proline-to-leucine variant at position 351 of the sequence. See text for details.
Summary
The rapid pace of discovery in the genomic and proteomic arenas requires that databases are built in a way that facilitates not just the storage of these data, but the efficient handling and retrieval of information from these databases. Many lessons have been learned over the decades regarding how to approach critical questions regarding design and content, often the hard way. Thus, the continued development of currently existing databases, as well as the conceptualization and creation of new types of databases, will be a critical focal point for the advancement of biological discovery. As should be obvious from this chapter, keeping databases up to date and accurate is a task that requires the active involvement of the biological community (Box 1.3). Therefore, it is incumbent upon all users to ensure the accuracy of these data in an active fashion, engaging the curators in a continuous dialog so that these widely used resources continue to remain a valuable resource to biologists worldwide.
Box 1.3 Ensuring the Continued Quality of Data in Public Sequence Databases
Given the roles of DDBJ, EMBL, and GenBank in maintaining the archive of all publicly available DNA, RNA, and protein sequences, the continued usefulness of this resource is highly dependent on the quality of data found within it. Despite the high degree of both manual and automated checking that takes place before a record becomes public, errors will still find their way into the databases. These errors may be trivial and have no biological consequence (e.g. an incorrect postal code), may be misleading (e.g. an organism having the correct genus but wrong species name), or downright incorrect (e.g. a full-length mRNA not having a CDS annotated on it). Sometimes, records may have incorrect reference blocks, preventing researchers from linking to the correct publication describing the sequence. Over time, many have taken an active role in reporting these errors but, more often than not, these errors are left uncorrected.
While the individual INSDC members have the responsibility for hosting and disseminating the data found within their databases, keep in mind that the ownership of the data rests with the original submitter – and these original submitters (or their designees) are the only ones who can make updates to their database records. To keep these community resources as accurate and up to date as possible, users are actively encouraged to report any errors found when using the databases in the course of their work so that the database administrators can follow up with the original submitters as appropriate.
Given below are the current e-mail addresses for submitting information regarding errors to the three major sequence databases. As all the databases share information with each other nightly, it is only necessary to report the error to one of the three members of the consortium. Authors are actively encouraged to check their own records periodically to ensure that the information they previously submitted is still accurate. Even though this charge to the community is discussed here in the context of the three major sequence databases, all databases provide similar mechanisms through which incorrect information can be brought to the attention of the database administrators.
DDBJ | [email protected] |
EMBL | [email protected] |
GenBank | [email protected] |
As alluded to above, the range of publicly available data obviously goes well beyond human