Bioinformatics. Группа авторов
Читать онлайн книгу.in context with other genome-scale data. User data must be formatted in a commonly used data structure in order to be interpreted correctly by the browser.
Browser Extensible Data (BED) format is a tab-delimited format that is flexible enough to display many types of data. It can be used to display fairly simple features like the location of transcription binding factor sites, as well more complex ones like transcripts and their exons.
Binary Alignment/Map (BAM) format is the compressed binary version of the Sequence Alignment/Map (SAM) format. It is a compact format designed for use with very large files of nucleotide sequence alignments. Because it can be indexed, only the portion of the file that is needed for display is transferred to the browser. Many tools for next generation sequence analysis use BAM format as output or input.
Variant Call Format (VCF) is a flexible format for large files of variation data including single-nucleotide variants, insertions/deletions, copy number variants, and structural variants. Like BAM format, it is compressed and indexed, and only the portion of the file that is needed for display is transferred to the browser. Many tools for variant analysis use VCF format as output or input.
The UCSC Genome Browser home page lists commonly accessed tools, as well as a frequently updated news section that highlights major data and software updates. To reach the Genome Browser Gateway, the main entry point for text-based searches, click on the Gateway link on the home page (Figure 4.1). The default assembly is the most recent human assembly, GRCh38, from December 2013. The genomes of other species can be selected from the phylogenetic tree on the left side of the Gateway page, or by typing their name in the selection box. On the human Gateway page, there is also the option to select one of four older human genome assemblies. Details about the GRCh38 assembly and instructions for searching are available on the Gateway page.
To perform a search, enter text into the Position/Search Term box. If the query maps to a unique position in the genome, such as a search for a particular chromosome and position, the Go button links directly to the Genome Browser. However, if there is more than one hit for the query, such as a search for the term metalloprotease
, the resulting page will contain a list of results that all contain that term. For some species, the terms have been indexed, and typing a gene symbol into the search box will bring up a list of possible matches. In this example, we will search for the human hypoxia inducible factor 1 alpha subunit (HIF1A) gene (Figure 4.1), which produces a single hit on GRCh38.
The default Genome Browser view showing the genomic context of the HIF1A gene is shown in Figure 4.2. The navigation controls are presented across the top of the display. The arrows move the window to the left and right along the chromosome. Alternatively, the user can move the display left and right by holding down the mouse button and dragging the window. To zoom in and out, use the buttons at the top of the display. The base button zooms in so far that individual nucleotides are displayed, while the zoom out 100× button will show the entire chromosome if it is pressed a few times. The current genomic position and the length of window (in nucleotides) is shown above a schematic of chromosome 14, where the current genomic position is highlighted with a red box. A new search term can be entered into the search box.
Figure 4.1 The home page of the UCSC Genome Browser, showing a query for the gene HIF1A on the human GRCh38 genome assembly. The organism can be selected by clicking on its name in the phylogenetic tree. For many organisms, more than one genome assembly is available. Typing a term into the Position/Search Term box returns a list of matching gene symbols.
Below the browser window illustrated in Figure 4.2, one would find a list of tracks that are available for display on the assembly. The tracks are separated into nine categories: Mapping and Sequencing, Genes and Gene Predictions, Phenotype and Literature, mRNA and Expressed Sequence Tag (EST), Expression, Regulation, Comparative Genomics, Variation, and Repeats. Clicking on a track name opens the Track Settings page for that track, providing a description of the data displayed in that track. Most tracks can be displayed in one of the following five modes.
1 Hide: the track is not displayed at all.
2 Dense: all features are collapsed into a single line; features are not labeled.
3 Squish: each feature is shown separately, but at 50% the height of full mode; features are not labeled.
4 Pack: each feature is shown separately, but not necessarily on separate lines; features are labeled.
5 Full: each feature is labeled and displayed on a separate line.
Figure 4.2 The default view of the UCSC Genome Browser, showing the genomic context of the human HIF1A gene.
In order to simplify the display, most tracks are in hide mode by default. To change the mode, use the pull-down menu below the track name or on the Track Settings page. Other settings, such as color or annotation details, can also be configured on the Track Settings page. For example, the NCBI RefSeq track allows users to select if they want to view all reference sequences or only those that are curated or predicted (Box 1.2). One possible point of confusion is that the UCSC Genome Browser will “remember” the mode in which each track is displayed from session to session. Custom settings can be cleared by selecting Reset all User Settings under the Genome Browser pull-down menu at the top of any page.
The annotation tracks in the window below the chromosome are the focus of the Genome Browser (Figure 4.2). Tracks are depicted horizontally, with a title above the track and labels on the left. The first two lines show the scale and chromosomal position. The term that was searched for and matched (HIF1A in this case) is highlighted on the annotation tracks. The next tracks shown by default are gene prediction tracks. The default gene track on GRCh38 is the GENCODE Genes set, which replaces the UCSC Genes track that is still displayed on GRCh37 and older human assemblies. GENCODE genes are annotated using a combination of computational analysis and manual curation, and are used by the ENCODE Consortium and other groups as reference gene sets (Box 4.2). The GENCODE v24 track depicts all of the gene models from the GENCODE v24 release, which includes both protein-coding genes and non-coding RNA genes.
Box 4.2 GENCODE
The GENCODE gene set was originally developed by the ENCODE Consortium as a comprehensive source of high-quality human gene annotations (Harrow et al. 2012). It has now been expanded to include the mouse genome (Mudge and Harrow 2015). The goal of the GENCODE project is to include all alternative splice variants of protein-coding loci, as well as non-coding loci and pseudogenes. The GENCODE Consortium uses computational methods, manual curation, and experimental validation to identify these gene features. The