Bioinformatics. Группа авторов

Читать онлайн книгу.

Bioinformatics - Группа авторов


Скачать книгу
themselves. Obviously, any given protein can contain one or more blocks, corresponding to each of its structural or functional motifs. With these protein blocks in hand, it was then possible to look for substitution patterns only in the most conserved regions of a protein, the regions that (presumably) were least prone to change. Two thousand blocks representing more than 500 groups of related proteins were examined and, based on the substitution patterns in those conserved blocks, blocks substitution matrices (or BLOSUMs, for short) were generated.

      Returning to the point of directly deriving the various matrices, each BLOSUM matrix is assigned a number (BLOSUMn), and that number represents the conservation level of the sequences that were used to derive that particular matrix. For example, the BLOSUM62 matrix is calculated from sequences sharing no more than 62% identity; sequences with more than 62% identity are clustered and their contribution is weighted to 1. The clustering reduces the contribution of closely related sequences, meaning that there is less bias toward substitutions that occur (and may be over-represented) in the most closely related members of a family. Reducing the value of n yields more distantly related sequences.

      Which Matrices Should be Used When?

       PAM250 is equivalent to BLOSUM45

       PAM160 is equivalent to BLOSUM62

       PAM120 is equivalent to BLOSUM80.

      In addition to the protein matrices discussed here, there are numerous specialized matrices that are either specific to a particular species, concentrate on particular classes of proteins (e.g. transmembrane proteins), focus on structural substitutions, or use hydrophobicity measures in attempting to assess similarity (see Wheeler 2003). Given this landscape, the most important take-home message for the reader is that no single matrix is the complete answer for all sequence comparisons. A thorough understanding of what each matrix represents is critical to performing proper sequence-based analyses.

Matrix Best use Similarity
PAM40 Short alignments that are highly similar 70–90%
PAM160 Detecting members of a protein family 50–60%
PAM250 Longer alignments of more divergent sequences ∼30%
BLOSUM90 Short alignments that are highly similar 70–90%
BLOSUM80 Detecting members of a protein family 50–60%
BLOSUM62 Most effective in finding all potential similarities 30–40%
BLOSUM30 Longer alignments of more divergent sequences <30%

      The Similarity column gives the range of similarities that the matrix is able to best detect (Wheeler 2003).

      Nucleotide Scoring Matrices

      Gaps and Gap Penalties

      Often times, gaps are introduced to improve the alignment between two nucleotide or protein sequences. These gaps compensate for insertions and deletions between the sequences being studied so, in essence, these gaps represent biological events. As such, the number of gaps introduced into a pairwise sequence alignment needs to be kept to a reasonable number so as to not yield a biologically implausible scenario.

Schematic illustration of the n nucleotide scoring table in which the scoring for the four nucleotide bases is shown in the upper left of the figure, with the remaining one-letter codes specifying the IUPAC/UBMB codes for ambiguities or chemical similarities.
Скачать книгу