Bioinformatics. Группа авторов. Читать онлайн. MREADZ.NET

Bioinformatics. Группа авторов

Читать онлайн книгу.

В начало <28 29 30 31 32 33 34 35 36 37 >В конец

Bioinformatics - Группа авторов

3.1) that considers how often a particular residue is observed, in nature, to replace another residue. The odds ratio also considers how often a particular residue would be replaced by another if replacements occurred in a random fashion (purely by chance). Given this, a positive score indicates two residues that are seen to replace each other more often than by chance, and a negative score indicates two residues that are seen to replace each other less frequently than would be expected by chance. Put more simply, frequently observed substitutions have positive scores and infrequently observed substitutions have negative scores.

Box 3.1 Scoring Matrices and the Log Odds Ratio

Protein scoring matrices are derived from the observed replacement frequencies of amino acids for one another. Based on these probabilities, the scoring matrices are generated by applying the following equation:

where p_i is the probability with which residue i occurs among all proteins and p_j is the probability with which residue j occurs among all proteins. The quantity q_i,j represents how often the two amino acids i and j are seen to align with one another in multiple sequence alignments of protein families or in sequences that are known to have a biological relationship. Therefore, the log odds ratio S_i,j (or “lod score”) represents the ratio of observed vs. random frequency for the substitution of residue i by residue j. For commonly observed substitutions, S_i,j will be greater than zero. For substitutions that occur less frequently than would be expected by chance, S_i,j will be less than zero. If the observed frequency and the random frequency are the same, S_i,j will be zero.

To explain the meaning of the numbers in the matrix more fully, imagine that two sequences have been aligned with one another, and it is now necessary to assess how well a residue in sequence A matches to a residue in sequence B at any given position of the alignment. Using the scoring matrix in Figure 3.1 as our starting point,

The values on the diagonal represent the score that would be conferred for an exact match at a given position, and these numbers are always positive. So, if a tryptophan residue (W) in sequence A is aligned with a tryptophan residue in sequence B, this match would be conferred 11 points, the value where the row marked W intersects the column marked W. Also notice that 11 is the highest value on the diagonal, so the high number of points assigned to a W:W alignment reflects not only the exact match but also the fact that tryptophan is the rarest of amino acids found in proteins. Put otherwise, the W:W alignment is much less likely to occur in general and, in turn, is more likely to be correct.

Moving off the diagonal, consider the case of a conservative substitution: a tyrosine (Y) for a tryptophan. The intersection of the row marked Y with the column marked W yields a value of 2. The positive value implies that the substitution is observed to occur more often in an alignment than it would by chance, but the replacement is not as good as if the tryptophan residue had been preserved (2 < 11) or if the tyrosine residue had been preserved (2 < 7).

Finally, consider the case of a non-conservative substitution: a valine (V) for a tryptophan. The intersection of the row marked V with the column marked W yields a value of −3. The negative value implies that the substitution is not observed to occur frequently and may arise more often than not by chance.

Although the meaning of the numbers and relationships within the scoring matrices seems straightforward enough, some value judgments need to be made as to what actually constitutes a conservative or non-conservative substitution and how to assess the frequency of either of those events in nature. This is the major factor that differentiates scoring matrices from one another. To help the reader make an intelligent choice, a discussion of the approach, advantages, and disadvantages of the various available matrices is in order.

PAM Matrices

The first useful matrices for protein sequence analysis were developed by Dayhoff et al. (1978). The basis for these matrices was the examination of substitution patterns in a group of proteins that shared more than 85% sequence identity. The analysis yielded 1572 changes in the 71 groups of closely related proteins that were examined. Using these results, tables were constructed that indicated the frequency of a given amino acid substituting for another amino acid at a given position.

As the sequences examined shared such a high degree of similarity, the resulting frequencies represent what would be expected over short evolutionary distances. Further, given the close evolutionary relationship between these proteins, one would expect that the observed mutations would not significantly change the function of the protein. This is termed acceptance: changes that can be accommodated through natural selection and result in a protein with the same or similar function as the original. As individual point mutations were considered, the unit of measure resulting from this analysis is the point accepted mutation or PAM unit. One PAM unit corresponds to one amino acid change per 100 residues, or roughly 1% divergence.

Several assumptions went into the construction of the PAM matrices. One of the most important assumptions was that the replacement of an amino acid is independent of previous mutations at the same position. Based on this assumption, the original matrix was extrapolated to come up with predicted substitution frequencies at longer evolutionary distances. For example, the PAM1 matrix could be multiplied by itself 100 times to yield the PAM100 matrix, which would represent what one would expect if there were 100 amino acid changes per 100 residues. (This does not imply that each of the 100 residues has changed, only that there were 100 total changes; some positions could conceivably change and then change back to the original residue.) As the matrices representing longer evolutionary distances are an extrapolation of the original matrix derived from the 1572 observed changes described above, it is important to remember that these matrices are, indeed, predictions and are not based on direct observation. Any errors in the original matrix would be exaggerated in the extrapolated matrices, as the mere act of multiplication would magnify these errors significantly.

There are additional assumptions that the reader should be aware of regarding the construction of these PAM matrices. All sites have been assumed to be equally mutable, replacement has been assumed to be independent of surrounding residues, and there is no consideration of conserved blocks or motifs. The sequences being compared here are of average composition based on the small number of protein sequences available in 1978, so there is a bias toward small, globular proteins, even though efforts have been made to bring in additional sequence data over time (Gonnet et al. 1992; Jones et al. 1992). Finally, there is an implicit assumption that the forces responsible for sequence evolution over shorter time spans are the same as those for longer evolutionary time spans. Although there are significant drawbacks to the PAM matrices, it is important to remember that, given the information available in 1978, the development of these matrices marked an important advance in our ability to quantify the relationships between sequences. As these matrices are still available for use with numerous bioinformatic tools, the reader should keep these potential drawbacks in mind and use them judiciously.

BLOSUM Matrices

In 1992, Steve and Jorja Henikoff took a slightly different approach to the one described above, one that addressed many of the drawbacks of the PAM matrices. The groundwork for the development of new matrices was a study aimed at identifying conserved motifs within families of proteins (Henikoff and Henikoff 1991, 1992). This study led to the creation of the BLOCKS database, which used the concept of a block to identify a family of proteins. The idea of a block is derived from the more familiar notion of a motif, which usually refers to a conserved stretch of amino acids that confers a specific function or structure to a protein. When these individual motifs from proteins in the same family can be aligned without introducing a gap, the result is a block, with the term block referring to the alignment, not the individual

Скачать книгу

В начало <28 29 30 31 32 33 34 35 36 37 >В конец