Bioinformatics. Группа авторов
Читать онлайн книгу.return an external identifier (i.e. the starting RefSeq accession number) and an ortholog in the same BioMart query. Starting with the same Filter and human RefSeq accession numbers as before, choose the Homologues section of the Attributes and select the human Ensembl gene identifier and gene name under Gene → Ensembl, as well as the mouse Ensembl gene identifier and gene name under Orthologues → Mouse Orthologues. The results are shown in Figure 4.22e. Note that not all of the human gene identifiers have been mapped to a corresponding mouse ortholog. The goal of this exercise was to identify the mouse orthologs of the human RefSeq accession numbers from the GWAS Catalog. Using the human Ensembl gene identifiers as a key, the human RefSeq accession numbers can be added to the list of mouse orthologs. This can be carried out by using the VLOOKUP function in Microsoft Excel, or by writing a script in your favorite programming language, and is left as an exercise for the reader.
JBrowse
While the UCSC and Ensembl Genome Browsers provide user-friendly interfaces for viewing genomic data from well-characterized organisms, there are fewer applications for displaying genome assemblies and annotations for newly sequenced organisms or non-standard assemblies. The source code and executables for the UCSC Genome Browser are freely available for academic, non-profit, and personal use, and can be set up to display custom data, not just those provided by UCSC. Thus, one option is for researchers to host their own UCSC Genome Browser and use it to share custom genomes with the bioinformatics community. An alternate method for sharing novel genome assemblies is to set up an Assembly Hub. Researchers host the specially formatted genomic sequence and data tracks on their own web site, and anyone with the URL can view the assembly though the UCSC Genome Browser.
Another way to share novel genome assemblies is to use JBrowse (Buels et al. 2016), a web-based genome browser that is part of the Generic Model Organism Database (GMOD) project, a suite of tools for generating genomic databases. JBrowse can handle data in a variety of formats, and is relatively easy to install on a Linux- or Mac OS X-based web server (Skinner and Holmes 2010). JBrowse browsers support plant genomes (e.g. Phytozome), animal genomes (e.g. the Rat Genome Database), and disease-related databases of human data (e.g. the COSMIC Genome Browser).
An example of using JBrowse to view a customized genome assembly and associated annotations is at the Mnemiopsis Genome Project (MGP) Portal at the National Human Genome Research Institute (NHGRI) of the US National Institutes of Health (NIH). Mnemiopsis leidyi is a type of ctenophore, or comb jelly, a phylum of gelatinous zooplankton found in all the world's seas. The members of this phylum are called comb jellies because of their highly ciliated comb rows, providing their primary means of locomotion, and these early branching metazoans have proven to be an important model organism for understanding the diversity and complexity seen in the early evolution of animals. The Mnemiopsis data featured in this portal are the first set of whole genome sequencing data on any ctenophore species to be published and made available to the scientific community (Moreland et al. 2014). The portal provides not only genomic and protein model sequence data, but also a BLAST search interface, pathway and protein domain analysis, and a customized genome browser, implemented in JBrowse, to display the annotation data.
The Mnemiopsis genome was assembled into 5100 scaffolds using next generation sequence data from the Roche 454 and Illumina GA-II methods of sequencing (Ryan et al. 2013). The Mnemiopsis protein-coding gene models were predicted by integrating the results of ab initio gene prediction programs with RNA-seq transcript data and sequence similarity to other protein datasets. A view of one of those scaffolds is shown in Figure 4.23. As with the UCSC and Ensembl Genome Browsers, data are organized in horizontal tracks, and exons are shown as colored boxes. The first track, SCF, is the scaffold. The gene model track, labeled 2.2, displays the exons of the predicted gene models. The next track, called PFAM2.2, highlights Pfam domains found in the gene model. The Mnemiopsis RNA-seq reads were assembled into transcripts using the Cufflinks program (Trapnell et al. 2010), and the CL2 track shows the alignment of those transcripts to the genomic scaffold. The MASK track highlights repetitive regions. The EST and GBNT tracks show, respectively, the alignment of publicly available Mnemiopsis EST and other RNA sequences from GenBank. These two tracks are empty in this region, so the gene in the gene model track is a novel gene prediction. The overlap between the exons on the Pfam and gene model tracks shows that the predicted gene contains known protein domains. The CL2 track lends further support to the gene prediction, as the exons of the experimentally derived Mnemiopsis transcripts overlap the exons on the gene model track.
Navigation in JBrowse is fairly straightforward, especially for those already accustomed to using the UCSC or Ensembl Genome Browsers. Tracks can be added or removed from display by using the checkboxes on the left side of the window. On the display window, click on a track name and drag it to move the track up or down. To shift the focus of the display window upstream or downstream, click on the display and drag it to the left or right. The left and right arrows at the top of the page also move the display window. JBrowse provides multiple ways to zoom in and out. One option is to use the plus and minus magnifying glasses at the top of the page. Alternatively, place the mouse in the sequence coordinates above the top track and click and drag to highlight a region and zoom in on it. Double clicking on a region also zooms in. Clicking on a track feature opens a window with additional information about that feature. For example, on the MGP Portal, clicking on a gene model in the 2.2 track opens the Gene Wiki for that model, a detailed page that includes nucleotide and protein sequences, pre-computed BLAST searches, and annotated Pfam domains. Note that although the general look and feel of JBrowse will remain similar across different genomes, individual JBrowse developers will create tracks and customizations that are specific to their genome project.
Figure 4.23 JBrowse display of a predicted Mnemiopsis gene (ML05372a) from the Mnemiopsis Genome Project Portal at the National Human Genome Research Institute. Seven tracks are shown on this display: SCF, assembled genomic regions are solid black and intermittent gaps are shaded bright pink; 2.2, consensus Mnemiopsis gene models; PFAM2.2, non-redundant Mnemiopsis protein domains derived from Pfam; CL2, RNA-seq reads derived from Mnemiopsis embryos, assembled into transcripts using Cufflinks (Trapnell et al. 2010); MASK, genomic regions that have been repeat-masked using VMatch are shaded in light blue; EST, Mnemiopsis expressed sequence tags (ESTs) from GenBank; GBNT, Mnemiopsis mRNAs and other non-EST RNAs from GenBank.
Summary
The UCSC and Ensembl Genome Browsers are sophisticated tools that provide free, web-based access to genome assemblies and annotations. This chapter has focused on examples from the human genome and a subset of the annotation tracks available for it. By adding tracks to the default view, users are able to view annotated genes, sequence variants, gene regulatory regions, gene expression data, and much more. The displays are highly customizable, and users can choose which data to view, the display style, and, in some cases, even change the colors of the annotated features. Both browsers can be accessed not only by text-based queries, such as gene symbol or chromosomal position, but also by searches with either nucleotide or protein sequences. The UCSC Genome Browser supports the BLAT search engine, while Ensembl supports both BLAT and BLAST, depending on the analysis type. Furthermore, the UCSC Table Browser and Ensembl's BioMart provide alternate entry points into the underlying data at each site, in which queries can be constructed using a web-based interface and data returned as text that can be downloaded and further manipulated. Although the examples illustrated in this chapter all derive from the GRCh38 assembly of the human genome, both UCSC