Bioinformatics. Группа авторов

Читать онлайн книгу.

Bioinformatics - Группа авторов


Скачать книгу
the development of a method through which all of the information about a particular biological entity could be found without having to sequentially visit and query individual databases, one by one.

      Relationships Between Database Entries: Neighboring

      The concept of neighboring enables entries within a given database to be connected to one another. If a user is looking at a particular PubMed entry, the user can then “ask” Entrez to find all of the other papers in PubMed that are similar in subject matter to the original paper. Likewise, if a user is looking at a sequence entry, Entrez can return a list of all other sequences that bear similarity to the original sequence. The establishment of neighboring relationships within a database is based on statistical measures of similarity, some of which are described in more detail below. While the term “neighboring” has traditionally been used to describe these connections, the terminology on the NCBI web site denotes neighbors as “related data.”

      BLAST Biological sequence similarities are detected and sequence data are compared with one another using the Basic Local Alignment Search Tool, or BLAST (Altschul et al. 1990). This algorithm attempts to find high-scoring segment pairs – pairs of sequences that can be aligned with one another and, when aligned, meet certain scoring and statistical criteria. Chapter 3 discusses the family of BLAST algorithms and their application at length.

      VAST Molecular structure similarities are detected and sets of coordinate data are compared using a vector-based method known as VAST (the Vector Alignment Search Tool; Gibrat et al. 1996). This methodology uses geometric criteria to assess similarity between three-dimensional domains, and there are three major steps that take place in the course of a VAST comparison:

       First, based on known three-dimensional coordinate data, the alpha helices and beta strands that constitute the structural core of each protein are identified. Straight-line vectors are then calculated based on the position of these secondary structural elements. VAST keeps track of how one vector is connected to the next (that is, how the C-terminal end of one vector connects to the N-terminal end of the next vector), as well as whether each vector represents an alpha helix or a beta strand. Subsequent comparison steps use only these vectors in assessing structural similarity to other proteins – so, in effect, most of the painstakingly deduced atomic coordinate data are discarded at this step. The reason for this apparent oversimplification is simply due to the scale of the problem at hand; with the 150 000 structures in the Molecular Modeling Database (MMDB; Madej et al. 2014) available at the time of this writing, the time that it would take to do an in-depth comparison of each and every one of these structures with all of the other structures in MMDB would make the calculations both impractical and intractable.

       Next, the algorithm attempts to optimally align these sets of vectors, looking for pairs of structural elements that are of the same type and relative orientation, with consistent connectivity between the individual elements. The object is to identify highly similar “core substructures,” pairs that represent a statistically significant match above that which would be obtained by comparing randomly chosen proteins with one another.

       Finally, a refinement is done using Monte Carlo (random search) methods at each residue position to optimize the structural alignment. The resultant alignment need not be global, as matches may be between individual structural domains of the proteins being compared.

      By using approaches such as VAST and VAST+, it is possible to find structural relationships between proteins in cases where simply looking at sequence similarity may not suggest relatedness – information that could, with additional data and insights, be used to help inform the question of functional similarity. More information on additional structure prediction methods based on X-ray or nuclear magnetic resonance (NMR) coordinate data can be found in Chapter 12.

      Weighted Key Terms The problem of comparing sequence or structure data somewhat pales next to that of comparing PubMed entries, which consist of free text whose rules of syntax are not necessarily fixed. Given that no two people's writing styles are exactly the same, finding a way to compare seemingly disparate blocks of text poses a substantial problem. Entrez employs a method known as the relevance pairs model of retrieval to make such comparisons, relying on weighted key terms (Wilbur and Coffee 1994; Wilbur and Yang 1996). This concept is best described by example. Consider two manuscripts with the following titles:

       BRCA1 as a Genetic Marker for Breast Cancer

       Genetic Factors in the Familial Transmission of the Breast Cancer BRCA1 Gene

      Both titles contain the terms BRCA1, Breast, and Cancer, and the presence of these common terms may indicate that the manuscripts are similar in subject matter. The proximity between the words is also considered, so that words common to two records that are closer together are scored higher than common words that are further apart. In the example, the terms Breast and Cancer are always next to each other, so they would score higher based on proximity than either of those words would against BRCA1. Common words found in a title score higher than those found in an abstract, since title words are presumed to be “more important” than those found in the body of an abstract. Overall, weighting depends inversely on the frequency of a given word among all the entries in PubMed, with words that occur infrequently in the database assigned a higher weight while common words are down-weighted.

      Hard Links

      The hard link concept is simpler


Скачать книгу