Bioinformatics. Группа авторов
Читать онлайн книгу.of hits having E values better than a certain set threshold. These hits, along with the initial, single-query sequence, are used to construct a PSSM in an automated fashion. As soon as the PSSM is constructed, the PSSM then serves as the query for doing a new search against the target database, using the collective characteristics of the identified sequences to find new, related sequences. The process continues, round by round, either until the search converges (meaning that no new sequences were found in the last round) or until the limit on the number of iterations is reached.
Performing a PSI-BLAST Search
PSI-BLAST searches can be initiated by following the Protein BLAST link on the BLAST landing page (Figure 3.5). The search page shown in Figure 3.14 is identical to the one shown in the BLASTP example discussed earlier in this chapter. Here, the sequence of the human sex-determining protein SRY from UniProtKB/Swiss-Prot (Q05066) will be used as the query, using UniProtKB/Swiss-Prot as the target database and limiting returned results to human sequences. PSI-BLAST is selected in the Program Selection section and, as before, selected changes will be made to the default parameters (Figure 3.15). The maximum number of target sequences has been raised from 500 to 1000, as a safeguard in case a large number of sequences in UniProtKB/Swiss-Prot match the query. In addition, both the E value threshold and the PSI-BLAST threshold have been changed to 0.001, and filtering of low-complexity regions has been enabled. The query can now be issued as before by clicking on the blue “BLAST” button at the bottom of the page.
Figure 3.13 Constructing a position-specific scoring matrix (PSSM). In the upper portion of the figure is a multiple sequence alignment of length 10. Using the criteria described in the text, the PSSM corresponding to this multiple sequence alignment is shown in the lower portion of the figure. Each row of the PSSM corresponds to a column in the multiple sequence alignment. Note that position 8 of the alignment always contains a threonine residue (T), whereas position 10 always contains a glycine (G). Looking at the corresponding scores in the matrix, in row 8, the threonine scores 150 points; in row 10, the glycine also scores 150 points. These are the highest values in the row, corresponding to the fact that the multiple sequence alignment shows absolute conservation at those positions. Now, consider position 9, where most of the sequences have a proline (P) at that position. In row 9 of the PSSM, the proline scores 89 points – still the highest value in the row, but not as high a score as would have been conferred if the proline residue was absolutely conserved across all sequences. The first column of the PSSM provides the deduced consensus sequence.
The results of the first round of the search are shown in Figure 3.16, with 31 sequences found in the first round (at the time of this writing). The structure of the hit list table is exactly as before, now containing two additional columns that are specific to PSI-BLAST. The first shows a column of check boxes that are all selected; this instructs the algorithm to use all the sequences to construct the first PSSM for this particular search. Keeping in mind that the first round of any PSI-BLAST search is simply a BLASTP search and that no PSSM has yet been constructed, the second column is blank. To run the next iteration of PSI-BLAST, simply click the “Go” button at the bottom of this section. At this point, the first PSSM is constructed based on a multiple sequence alignment of the sequences selected for inclusion, and the matrix is now used as the query against Swiss-Prot. The results of this second round are shown in Figure 3.17, with the final two columns indicating which sequences are to be used in constructing the new PSSM for the next round of searches, as well as which sequences were used to build the PSSM for the current round. Also note that a good number of the sequences are highlighted in yellow; here, 26 additional sequences that scored below the PSI-BLAST threshold in the first round have now been pulled into the search results. This provides an excellent example of how PSSMs can be used to discover new relationships during each PSI-BLAST iteration, thereby making it possible to identify additional homologs that may not have been found using the standard BLASTP approach. Of course, the user should always check the E values and percent identities for all returned results before passing them through to the next round, unchecking inclusion boxes as needed. There may also be cases where prior knowledge would argue for removing some of the found sequences based on the descriptors. As with all computational methods, it is always important to keep biology in mind when reviewing the results.
Figure 3.14 Performing a PSI-BLAST search. See text for details.
BLAT
In response to the assembly needs of the Human Genome Project, a new nucleotide sequence alignment program called BLAT (for BLAST-Like Alignment Tool) was introduced (Kent 2002). BLAT is most similar to the MegaBLAST version of BLAST in that it is designed to rapidly align longer nucleotide sequences having more than 95% similarity. However, the BLAT algorithm uses a slightly different strategy than BLAST to achieve faster speeds. Before any searches are performed, the target databases are pre-indexed, keeping track of all non-overlapping 11-mers; this index is then used to find regions similar to the query sequence. BLAT is often used to find the position of a sequence of interest within a genome or to perform cross-species analyses.
Figure 3.15 Selecting algorithm parameters for a PSI-BLAST search. See text for details.
As an example, consider a case where an investigator wishes to map a cDNA clone coming from the Cancer Genome Anatomy Project (CGAP) to the rat genome. The BLAT query page is shown in Figure 3.18, and the sequence of the clone of interest has been pasted into the sequence box. Above the sequence box are several pull-down menus that can be used to specify which genome should be searched (organism), which assembly should be used (usually, the most recent), and the query type (DNA, protein, translated DNA, or translated RNA). Once the appropriate choices have been made, the search is commenced by pressing the “Submit” button. The results of the query are shown in the upper panel of Figure 3.19; here, the hit with the highest score is shown at the top of the list, a match having 98.1% identity with the query sequence. More details on this hit can be found by clicking the “details” hyperlink, to the left of the entry. A long web page is then returned, providing information on the original query, the genomic sequence, and an alignment of the query against the found genomic sequence (Figure 3.19, bottom panel). The genomic sequence here is labeled chr5, meaning that the query corresponds to a region of rat chromosome 5. Matching bases in the cDNA and genomic sequences are colored in dark blue and are capitalized. Lighter blue uppercase bases mark the boundaries of aligned regions and often signify splice sites. Gaps and unaligned regions are indicated by lower case black type. In the Side by Side Alignment, exact matches are indicated by the vertical line between the two sequences. Clicking on the “browser” hyperlink in the upper panel of Figure 3.19 would take the user to the UCSC Genome Browser,