Bioinformatics. Группа авторов
Читать онлайн книгу.considered."/>
Figure 3.20 The FASTA search strategy. (a) Once FASTA determines words of length ktup common to the query sequence and the target sequence, it connects words that are close to each other, and these are represented by the diagonals. (b) After an initial round of scoring, the top 10 diagonals are selected for further analysis. (c) The Smith–Waterman algorithm is applied to yield the optimal pairwise alignment between the two sequences being considered. See text for details.
In the fourth and final step, FASTA assesses the significance of the alignments by estimating what the anticipated distribution of scores would be for randomly generated sequences having the same overall composition (i.e. sequence length and distribution of amino acids or nucleotides). Based on this randomization procedure and on the results from the original query, FASTA calculates an expectation value E (similar to the BLAST E value), which, as before, represents the probability that a reported hit has occurred purely by chance.
Running a FASTA Search
The University of Virginia provides a web front-end for issuing FASTA queries. Various protein and nucleotide databases are available, and up to two databases can be selected for use in a single run. From this page, the user can also specify the scoring matrix to be used, gap and extension penalties, and the value for ktup. The default values for ktup are 2 for protein-based searches and 6 for nucleotide-based searches; lowering the value of ktup increases the sensitivity of the run, at the expense of speed. The user can also limit the results returned to particular E values.
The results returned by a FASTA query are in a significantly different format than those returned by BLAST. Consider a FASTA search using the sequence of histone H2B.3 from the highly regenerative cnidarian Hydractinia, one of four novel H2B variants used in place of protamines to compact sperm DNA (KX622131.1; Török et al. 2016), as the query. The first part of the FASTA output resulting from a search using BLOSUM62 as the scoring matrix and Swiss-Prot as the target database is shown in Figure 3.21, summarizing the results as a histogram. The histogram is intended to convey the distribution of all similarity scores computed in the course of this particular search. The first column represents bins of similarity scores, with the scores increasing as one moves down the page. The second column gives the actual number of sequences observed to fall into each one of these bins. This count is also represented by the length of each of the lines in the histogram, with each of the equals signs representing a certain number of sequences; in the figure, each equals sign corresponds to 130 sequences from UniProtKB/Swiss-Prot. The third column of numbers represents how many sequences would be expected to fall into each one of the bins; this is indicated by the asterisks in the histogram. The hit list would immediately follow, and a portion of the hit list for this search is shown in Figure 3.22. Here, the accession number and partial definition line for each hit is given, along with its optimal similarity score (opt
), a normalized score (bit
), the expectation value (E
), percent identity and similarity figures, and the aligned length. Not shown here are the individual alignments of each hit to the original query sequence, which would be found by further scrolling down in the output. In the pairwise alignments, exact matches are indicated by a colon, while conservative substitutions are indicated by a dot.
Statistical Significance of Results
As before, the E values from a FASTA search represent the probability that a hit has occurred purely by chance. Pearson (2016) puts forth the following guidelines for inferring homology from protein-based searches, which are slightly different than those previously described for BLAST: an E value < 10−6 almost certainly implies homology. When E < 10−3, the query and found sequences are almost always homologous, but the user should guarantee that the highest scoring unrelated sequence has an E value near 1.
Comparing FASTA and BLAST
Since both FASTA and BLAST employ rigorous algorithms to find sequences that are statistically (and hopefully biologically) relevant, it is logical to ask which one of the methods is the better choice. There actually is no good answer to the question, since both of the methods bring significant strengths to the table. Summarized below are some of the fine points that distinguish the two methods from one another.
Figure 3.21 Search summary from a protein–protein FASTA search, using the sequence of histone H2B.3 from Hydractinia echinata (KX622131.1; Török et al. 2016) as the query and BLOSUM62 as the scoring matrix. The header indicates that the query is against the Swiss-Prot database. The histogram indicates the distribution of all similarity scores computed for this search. The left-most column provides a normalized similarity score, and the column marked opt
gives the number of sequences with that score. The column marked E()
gives the number of sequences expected to achieve the score in the first column. In this case, each equals sign in the histogram represents 130 sequences in Swiss-Prot. The asterisks in each row indicate the expected, random distribution of hits. The inset is a magnified version of the histogram in that region.
Figure 3.22 Hit list for the protein–protein FASTA search described in Figure 3.21. Only the first 18 hits are shown. For each hit, the accession number and partial definition line for the hit is provided. The column marked opt
gives the raw similarity score, the column marked bits
gives a normalized bit score (a measure of similarity between the two sequences), and the column marked E
gives the expectation value. The percentage columns indicate percent identity and percent similarity, respectively. The alen
column gives the total aligned length for each hit. The +-
characters shown at the beginning of some lines indicate that more than one alignment was found between the query and subject; in the case of the first hit (Q7Z5P9
), four alignments were returned. The align
link at the end of each row takes the user to the alignment for that hit (not shown).
FASTA begins the search by looking for exact matches of words, while BLAST allows for conservative substitutions in the first step.
BLAST allows for automatic masking of sequences, while FASTA does not.
FASTA will return one and only one alignment for a sequence in the hit list, while BLAST can return multiple results for the same sequence, each result representing a distinct HSP.
Since FASTA uses a version of the more rigorous Smith–Waterman alignment method, it generally produces better final alignments and is more apt to find distantly related sequences than BLAST. For highly similar sequences, their performance is fairly similar.
When comparing translated DNA sequences with protein sequences or vice versa, FASTA (specifically, FASTX/FASTY for translated DNA → protein and TFASTX/TFASTY for protein → translated DNA) allows for frameshifts.
BLAST runs faster than FASTA,