Bioinformatics. Группа авторов. Читать онлайн. MREADZ.NET

Bioinformatics. Группа авторов

Читать онлайн книгу.

В начало <35 36 37 38 39 40 41 42 43 44 >В конец

Bioinformatics - Группа авторов

information about the genomic assembly in this region of rat chromosome 5 (specifically, at 5q31) can be obtained (cf. Chapter 4).

Snapshot depicts the results of the first round of a PSI-BLAST search.

Figure 3.16 Results of the first round of a PSI-BLAST search. For each sequence found, the user is presented with the definition line from the corresponding UniProtKB/Swiss-Prot entry, the score value for the best high-scoring segment pair (HSP) alignment, the total of all scores across all HSP alignments, the percentage of the query covered by the HSPs, and the E value and percent identity for the best HSP alignment. The hyperlinked accession number allows for direct access to the source database record for that hit. Sequences whose “Select for PSI blast” box are checked will be used to calculate a position-specific scoring matrix (PSSM), and that PSSM then serves as the new “query” for the next round, the results of which are shown in Figure 3.17.

Snapshot depicts the results of the second round of a PSI-BLAST search in which the new sequences identified through the use of the position-specific scoring matrix.

Figure 3.17 Results of the second round of a PSI-BLAST search. New sequences identified through the use of the position-specific scoring matrix (PSSM) calculated based on the results shown in Figure 3.16 are highlighted in yellow. Check marks in the right-most column indicate which sequences were used to build the PSSM producing these results.

Snapshot depicts the submission of a BLAT query.

Figure 3.18 Submitting a BLAT query. A rat clone from the Cancer Genome Anatomy Project Tumor Gene Index (CB312815) is the query. The pull-down menus at the top of the page can be used to specify which genome should be searched (organism), which assembly should be used (usually, the most recent), and the query type (DNA, protein, translated DNA, or translated RNA). The “I'm feeling lucky” button returns only the highest scoring alignment and provides a direct path to the UCSC Genome Browser.

FASTA

While the most commonly used technique for detecting similarity between sequences is BLAST, it is not the only heuristic method that can be used to rapidly and accurately compare sequences with one another. In fact, the first widely used program designed for database similarity searching was FASTA (Lipman and Pearson 1985; Pearson and Lipman 1988; Pearson 2000). Like BLAST, FASTA enables the user to rapidly compare a query sequence against large databases, and various versions of the program are available (Table 3.3). In addition to the main implementations, a variety of specialized FASTA versions are available, described in detail in Pearson (2016). An interesting historical note is that the FASTA format for representing nucleotide and protein sequences originated with the development of the FASTA algorithm.

Snapshot depicts the results of a BLAT query in which the highest scoring hit is to a sequence on chromosome five rat genome having ninety-eight-point one percentage sequence identity.

Figure 3.19 Results of a BLAT query. Based on the query submitted in Figure 3.18, the highest scoring hit is to a sequence on chromosome 5 rat genome having 98.1% sequence identity. Clicking on the “details” hyperlink brings the user to additional information on the found sequence, shown in the lower panel. Matching bases in the cDNA and genomic sequences are colored in dark blue and are capitalized. Lighter blue uppercase bases mark the boundaries of aligned regions and often signify splice sites. Gaps are indicated by lowercase black type. In the side-by-side alignment, exact matches are indicated by the vertical line between the sequences.

Table 3.3 Main FASTA algorithms.

Program	Query	Database	Corresponding BLAST Program
FASTA	Nucleotide	Nucleotide	BLASTN
	Protein	Protein	BLASTP
FASTX/FASTY	DNA	Protein	BLASTX
TFASTYX/TFASTY	Protein	Translated DNA	TBLASTN

The Method

The FASTA algorithm can be divided into four major steps. In the first step, FASTA determines all overlapping words of a certain length both in the query sequence and in each of the sequences in the target database, creating two lists in the process. Here, the word length parameter is called ktup, which is the equivalent of W in BLAST. These lists of overlapping words are compared with one another in order to identify any words that are common to the two lists. The method then looks for word matches that are in close proximity to one another and connects them to each other (intervening sequence included), without introducing any gaps. This can be represented using a dotplot format (Figure 3.20a). Once this initial round of connections are made, an initial score (init₁) is calculated for each of the regions of similarity.

In step 2, only the 10 best regions for a given pairwise alignment are considered for further analysis (Figure 3.20b). FASTA now tries to join together regions of similarity that are close to each other in the dotplot but that do not lie on the same diagonal, with the goal of extending the overall length of the alignment (Figure 3.20c). This means that insertions and deletions are now allowed, but there is a joining penalty for each of the diagonals that are connected. The net score for any two diagonals that have been connected is the sum of the score of the original diagonals, less the joining penalty. This new score is referred to as init_n.

In step 3, FASTA ranks all of the resulting diagonals, and then further considers only the “best” diagonals in the list. For each of the best diagonals, FASTA uses a modification of the Smith–Waterman algorithm (1981) to come up with the optimal pairwise alignment between the two sequences being considered. A final, optimal score (opt) is calculated on this pairwise alignment.

Schematic illustrations of the FASTA search strategy. (a) Once FASTA determines words of length ktup common to the query sequence and the target sequence, it connects words that are close to each other, and these are represented by the diagonals. (b) After an initial round of scoring, the top ten diagonals are selected for further analysis. (c) The Smith-Waterman algorithm is applied to yield the optimal pairwise alignment <hr><noindex><a href=

Скачать книгу

В начало <35 36 37 38 39 40 41 42 43 44 >В конец