Bioinformatics. Группа авторов

Читать онлайн книгу.

Bioinformatics - Группа авторов


Скачать книгу
step is carried out by the same Ensembl gene annotation pipeline that is used to annotate all vertebrate genomes displayed at Ensembl (Aken et al. 2016). This pipeline aligns cDNAs, proteins, and RNA-seq data to the human genome in order to create candidate transcript models. All Ensembl transcript models are supported by experimental evidence; no models are created solely from ab initio predictions. The Human and Vertebrate Analysis and Annotation (HAVANA) group produces manually curated gene sets for several vertebrate genomes, including mouse and human. These manually curated genes are merged with the Ensembl transcript models to create the GENCODE gene sets for mouse and human. A subset of the human models has been confirmed by an experimental validation pipeline (Howald et al. 2012).

      The consortium makes available two types of GENCODE gene sets. The Comprehensive set encompasses all gene models, and may include many alternatively spliced transcripts (isoforms) for each gene. The Basic set includes a subset of representative transcripts for each gene that prioritizes full-length protein-coding transcripts over partial- or non-protein-coding transcripts. The Ensembl Genome Browser displays the Comprehensive set by default. Although the UCSC Genome Browser displays the Basic set by default, the Comprehensive set can be selected by changing the GENCODE track settings. At the time of this writing, Ensembl is displaying GENCODE v27, released in August 2017. The GENCODE version available by default at the UCSC Genome Browser is v24, from December 2015. More recent versions of GENCODE can be added to the browser by selecting them in the All GENCODE super-track.

      GENCODE and RefSeq both aim to provide a comprehensive gene set for mouse and human. Frankish et al. (2015) have shown that, in human, the RefSeq gene set is more similar to the GENCODE Basic set, while the GENCODE Comprehensive set contains more alternative splicing and exons, as well as more novel protein-coding sequences, thus covering more of the genome. They also sought to determine which gene set would provide the best reference transcriptome for annotating variants. They found that the GENCODE Comprehensive set, because of its better genomic coverage, was better for discovering new variants with functional potential, while the GENCODE Basic set may be better suited for applications where a less complex set of transcripts is needed. Similarly, Wu et al. (2013) compared the use of different gene sets to quantify RNA-seq reads and determine gene expression levels. Like Frankish et al., they recommend using less complex gene annotations (such as the RefSeq gene set) for gene expression estimates, but more complex gene annotations (such as GENCODE) for exploratory research on novel transcriptional or regulatory mechanisms.

Snapshot depicts the genomic context of the human HIF1A gene, after clicking on zoom out three times. Snapshot depicts the RefSeq Track Settings page. Snapshot depicts the genomic context of the human HIF1A gene, after displaying RefSeq Curated genes in full mode.