Bioinformatics. Группа авторов
Читать онлайн книгу.and coding-non-synonymous SNPs (red). In addition, change the Display mode of the track from dense to pack so that the individual SNPs can be seen. By default, the function of each variant is defined by its position within transcripts in the GENCODE track. However, the track used for annotation can be changed in the settings called Use Gene Tracks for Functional Annotation.
Figure 4.9 The genomic context of the human HIF1A gene, after changing the colors and display mode of the Common SNPs(150) track as shown in Figure 4.8. The SNPs in the 5′ and 3′ untranslated regions of the HIF1A GENCODE transcripts are now colored blue, while the coding-synonymous SNP is colored green.
Two types of Expression tracks display data from the NIH Genotype-Tissue Expression (GTEx) project (GTEx Consortium 2015). The GTEx Gene track displays gene expression levels in 51 tissues and two cell lines, based on RNA-seq data from 8555 samples. The GTEx Transcript track provides additional analysis of the same data and displays median transcript expression levels. By default, the GTEx Gene track is shown in pack mode, while the GTEx Transcript track is hidden. Figure 4.10 shows the Gene track in pack display mode, in the region of the phenylalanine hydroxylase (PAH) gene. The height of each bar in the bar graph represents the median expression level of the gene across all samples for a tissue, and the bar color indicates the tissue. The PAH gene is highly expressed in kidney and liver (the two brown bars). The expression is more clearly visible in the details page for the GTEx track (Figure 4.10, inset, purple box). The GTEx Transcript track is similar, but depicts expression for individual transcripts rather than an average for the gene.
An alternate entry point to the UCSC Genome Browser is via a BLAT search (see Chapter 3), where a user can input a nucleotide or protein sequence to find an aligned region in a selected genome. BLAT excels at quickly identify a matching sequence in the same or highly similar organism. We will attempt to use BLAT to find a lizard homolog of the human gene disintegrin and metalloproteinase domain-containing protein 18 (ADAM18). The ADAM18 protein sequence is copied in FASTA format from the NCBI view of accession number NP_001307242.1 and pasted into the BLAT Search box that can be accessed from the Tools pull-down menu; the method for retrieving this sequence in the correct format is described in Chapter 2. Select the lizard genome and assembly AnoCar2.0/anoCar2. BLAT will automatically determine that the query sequence is a protein and will compare it with the lizard genome translated in all six reading frames. A single result is returned (Figure 4.11a). The alignment between the ADAM18 protein sequence and lizard chromosome Un_GL343418 runs from amino acid 368 to amino acid 383, with 81.3% identity. The browser link depicts the genomic context of this 48 nt hit (Figure 4.11b). Although the ADAM18 protein sequence aligns to a region in which other human ADAM genes have also been aligned, the other human genes are represented by a thin line, indicating a gap in their alignment. The details link shown in Figure 4.11a produces the alignment between the ADAM18 protein and lizard chromosome Un_GL343418 (Figure 4.11c). The top section of the results shows the protein query sequence, with the blue letters indicating the short region of alignment with the genome. The bottom section shows the pairwise alignment between the protein and genomic sequence translated in six frames. Vertical black lines indicate identical sequences. Taken together, the BLAT results show that only 16 amino acids of the 715 amino acid ADAM18 protein align to the lizard genome (Figure 4.11c). This alignment is short and likely does not represent a homologous region between the ADAM18 protein and the lizard genome. Thus, the BLAT algorithm, although fast, is not always sensitive enough to detect cross-species orthologs. The BLAST algorithm, described in the Ensembl Genome Browser section, is more sensitive, and is a better choice for identifying such homologs.
Figure 4.10 The GTEx Gene track, which depicts median gene expression levels in 51 tissues and two cell lines, based on RNA-seq data from the GTEx project from 8555 tissue samples. The main browser window depicts the GTEx Gene track for the human PAH gene, showing high expression in the two tissues colored brown (liver and kidney) but low or no expression in others. Clicking on the GTEx track opens it in a larger window, shown in the inset.
Figure 4.11 BLAT search at the UCSC Genome Browser. (a) This page shows the results of running a BLAT search against the lizard genome, using as a query the human protein sequence of the gene ADAM18, accession NP_001307242.1. The ADAM18 protein sequence is available from NCBI at www.ncbi.nlm.nih.gov/protein/NP_001307242.1?report=fasta. At the UCSC Genome Browser, the web interface to the BLAT search is in the Tools menu at the top of each page. The BLAT search was run against the lizard genome assembly from May 2010, also called anoCar2. The columns on the results page are as follows: ACTIONS, links to the browser (Figure 4.11b) and details (Figure 4.11c); QUERY, the name of the query sequence; SCORE, the BLAT score, determined by the number of matches vs. mismatches in the final alignment of the query to the genome; START, the start coordinate of the alignment, on the query sequence; END, the end coordinate of the alignment, on the query sequence; QSIZE, the length of the query; IDENTITY, the percent identity between the query and the genomic sequences; CHRO, the chromosome to which the query sequence aligns; STRAND, the chromosome strand to which the query sequence aligns; START; the start coordinate of the alignment, on the genomic sequence; END, the end coordinate of the alignment, on the genomic sequence; and SPAN, the length of the alignment, on the genomic sequence. Note that, in this example, there is a single alignment; searches with other sequences may result in many alignments, each shown on a separate