Bioinformatics. Группа авторов

Читать онлайн книгу.

Bioinformatics - Группа авторов


Скачать книгу
overall small number of available sequences. By the late 1970s, when a significant number of nucleotide sequences became available, those were also included in later editions of the Atlas. As this collection evolved, it included text-based descriptions to accompany the protein sequences, as well as information regarding the evolution of many protein families. This work, in essence, was the first annotated sequence database, even though it was in printed form. Over time, the amount of data contained in the Atlas became unwieldy and the need for it to be available in electronic form became obvious. From the early 1970s to the late 1980s, the contents of the Atlas were distributed electronically by NBRF (and later by the Protein Information Resource, or PIR) on magnetic tape, and the distribution included some basic programs that could be used to search and evaluate distant evolutionary relationships.

      In parallel with the early work being done on DNA sequence databases, the foundations for the Swiss-Prot protein sequence database were also being laid in the early 1980s by Amos Bairoch, recounting its history from an engaging perspective in a first-person review (Bairoch 2000). Bairoch converted PIR's Atlas to a format similar to that used by EMBL for its nucleotide database. In this initial release, called PIR+, additional information about each of the proteins was added, increasing its value as a curated, well-annotated source of information on proteins. In the summer of 1986, Bairoch began distributing PIR+ on the US BIONET (a precursor to the Internet), renaming it Swiss-Prot. At that time, it contained the grand sum of 3900 protein sequences. This was seen as an overwhelming amount of data, in stark contrast to today's standards. As Swiss-Prot and EMBL followed similar formats, a natural collaboration developed between these two groups, and these collaborative efforts strengthened when both EMBL's and Swiss-Prot's operations were moved to EMBL's European Bioinformatics Institute (EBI; Cook et al. 2018) in Hinxton, UK. One of the first collaborative projects undertaken by the Swiss-Prot and EMBL teams was to create a new and much larger protein sequence database supplement to Swiss-Prot. As maintaining the high quality of Swiss-Prot entries was a time-consuming process involving extensive sequence analysis and detailed curation by expert annotators (Apweiler 2001), and to allow the quick release of protein data not yet annotated to Swiss-Prot's stringent standards, a new database called TrEMBL (for “translation of EMBL nucleotide sequences”) was created. This supplement to Swiss-Prot initially consisted of computationally annotated sequence entries derived from the translation of all coding sequences (CDSs) found in INSDC databases. In 2002, a new effort involving the Swiss Institute of Bioinformatics, EMBL-EBI, and PIR was launched, called the UniProt consortium (UniProt Consortium 2017). This effort gave rise to the UniProt Knowledgebase (UniProtKB), consisting of Swiss-Prot, TrEMBL, and PIR. A similar effort also gave rise to the NCBI Protein Database, bringing together data from numerous sources and described more fully in the text that follows.

      As described above, the major sources of nucleotide sequence data are the databases involved in INSDC – DDBJ, ENA, and GenBank – with new or updated data being shared between these three entities once every 24 hours. This transfer is facilitated by the use of common data formats for the kinds of information described in detail below.

      The elementary format underlying the information held in sequence databases is a text file called the flatfile. The correspondence between individual flatfile formats greatly facilitates the daily exchange of data between each of these databases. In most cases, fields can be mapped on a one-to-one basis from one flatfile format to the other. Over time, various file formats have been adopted and have found continued widespread use; others have fallen to the wayside for a variety of reasons. The success of a given format depends on its usefulness in a variety of contexts, as well as its power in effectively containing and representing the types of biological data that need to be archived and communicated to scientists.

      In its simplest form, a sequence record can be represented as a string of nucleotides with some basic tag or identifier. The most widely used of these simple formats is FASTA, originally introduced as part of the FASTA software suite developed by Lipman and Pearson (1985) that is described in detail in Chapter 3. This inherently simple format provides an easy way of handling primary data for both humans and computers, taking the following form.

       >U54469.1 CGGTTGCTTGGGTTTTATAACATCAGTCAGTGACAGGCATTTCCAGAGTTGCCCTGTTCAACAATCGATA GCTGCCTTTGGCCACCAAAATCCCAAACTTAATTAAAGAATTAAATAATTCGAATAATAATTAAGCCCAG TAACCTACGCAGCTTGAGTGCGTAACCGATATCTAGTATACATTTCGATACATCGAAATCATGGTAGTGT TGGAGACGGAGAAGGTAAGACGATGATAGACGGCGAGCCGCATGGGTTCGATTTGCGCTGAGCCGTGGCA GGGAACAACAAAAACAGGGTTGTTGCACAAGAGGGGAGGCGATAGTCGAGCGGAAAAGAGTGCAGTTGGC

      For brevity, only the first few lines of the sequence are shown. In the simplest incarnation of the FASTA format, the “greater than” character (>) designates the beginning of a new sequence record; this line is referred to as the definition line (commonly called the “def line”). A unique identifier – in this case, the accession.version number (U54469.1) – is followed by the nucleotide sequence, in either uppercase or lowercase letters, usually with 60 characters per line. The accession number is the number that is always associated with this sequence (and should be cited in publications), while the version number suffix allows users to easily determine whether they are


Скачать книгу