Bioinformatics. Группа авторов
Читать онлайн книгу.nodes in a taxonomic tree, with the most general grouping (Eukaryota) given first.
OS Drosophila melanogaster (fruit fly) OC Eukaryota; Metazoa; Ecdysozoa; Arthropoda; Hexapoda; Insecta; Pterygota; OC Neoptera; Holometabola; Diptera; Brachycera; Muscomorpha; Ephydroidea; OC Drosophilidae; Drosophila; Sophophora.
Each record must have at least one reference or citation, noted within what are called reference blocks. These reference blocks offer scientific credit and set a context explaining why this particular sequence was determined. The reference blocks take the following form.
RN [1] RP 1-2881 RX DOI; .1074/jbc.271.27.16393. RX PUBMED; 8663200. RA Lavoie C.A., Lachance P.E., Sonenberg N., Lasko P.; RT "Alternatively spliced transcripts from the Drosophila eIF4E gene produce RT two different Cap-binding proteins"; RL J Biol Chem 271(27):16393-16398(1996). XX RN [2] RP 1-2881 RA Lasko P.F.; RT ; RL Submitted (09-APR-1996) to the INSDC. RL Paul F. Lasko, Biology, McGill University, 1205 Avenue Docteur Penfield, RL Montreal, QC H3A 1B1, Canada
In this case, two references are shown, one referring to a published paper and the other referring to the submission of the sequence record itself. In the example above, the second block provides information on the senior author of the paper listed in the first block, as well as the author's postal address. While the date shown in the second block indicates when the sequence (and accompanying information) was submitted to the database, it does not indicate when the record was first made public, so no inferences or claims based on first public release can be made based on this date. Additional submitter blocks may be added to the record each time the sequence is updated.
Some headers may contain COMMENT (DDBJ/GenBank) or CC (ENA) lines. These lines can include a great variety of notes and comments (descriptors) that refer to the entire record. Often, genome centers will use these lines to provide contact information and to confer acknowledgments. Comments also may include the history of the sequence. If the sequence of a particular record is updated, the comment will contain a pointer to the previous versions of the record. Alternatively, if an earlier version of the record is retrieved, the comment will point forward to the newer version, as well as backwards, if there was a still earlier version. Finally, there are database cross-reference lines (marked DR) that provide links to allied databases containing information related to the sequence of interest. Here, a cross-reference to FlyBase can be seen in the complete header for this record in Appendix 1.1. Note that the corresponding DDBJ/GenBank header in Appendix 1.2 does not contain these cross-references.
The Feature Table
Early on in the collaboration between INSDC partner organizations, an effort was made to come up with a common way to represent the biological information found within a given database record. This common representation is called the feature table, consisting of feature keys (a single word or abbreviation indicating the described biological property), location information denoting where the feature is located within the sequence, and additional qualifiers providing additional descriptive information about the feature. The online INSDC feature table documentation is extensive and describes in great detail what features are allowed and what qualifiers can be used with each individual feature. Wording within the feature table uses common biological research terminology wherever possible and is consistent between DDBJ, ENA, and GenBank entries.
Here, we will dissect the feature table for the eukaryotic transcription factor 4E gene from Drosophila melanogaster, shown in its entirety in both Appendices 1.3 (in ENA format) and 1.4 (in DDBJ/GenBank format). This particular sequence is alternatively spliced, producing two distinct gene products, 4E-I and 4E-II. The first block of information in the feature table is always the source feature, indicating the biological source of the sequence and additional information relating to the entire sequence. This feature must be present in all INSDC entries, as all DNA or RNA sequences derive from some specific biological source, including synthetic DNA.
FT source 1..2881 FT /organism="Drosophila melanogaster" FT /chromosome="3" FT /map="67A8-B2" FT /mol_type="genomic DNA" FT /db_xref="taxon:7227" FT gene 80..2881 FT /gene="eIF4E"
In the first line of the source key, notice that the numbering scheme shows the range of positions covered by this feature key as two numbers separated by two dots (1..2881). As the source key pertains to the entire sequence, we can infer that the sequence described in this entry is 2881 nucleotides in length. The various ways in which the location of any given feature can be indicated are shown in Table 1.1, accounting for a wide range of biological scenarios. The qualifiers then follow, each preceded by a slash. The full scientific name of the organism is provided, as are specific mapping coordinates, indicating that this sequence is at map location 67A8-B2 on chromosome 3. Also indicated is the type of molecule that was sequenced (genomic DNA). Finally, the last line indicates a database cross-reference (abbreviated as db_xref) to the NCBI taxonomy database, where taxon 7227 corresponds to D. melanogaster. In general, these cross-references are controlled qualifiers that allow entries to be connected to an external database, using an identifier that is unique to that external database. Following the source block above is the gene feature, indicating that the gene itself is a subset of the entire sequence in this entry, starting at position 80 and ending at position 2881.
FT mRNA join(80..224,892..1458,1550..1920,1986..2085,2317..2404, FT 2466..2881) FT /gene="eIF4E" FT /product="eukaryotic initiation factor 4E-I" FT mRNA join(80..224,1550..1920,1986..2085,2317..2404,2466..2881) FT /gene="eIF4E" FT /product="eukaryotic initiation factor 4E-II"
Table 1.1 Indicating locations within the feature table.
345
|
Single position within the sequence |
345..500
|
A continuous range of positions bounded by and including the indicated positions |
<345..500
|
A continuous range of positions, where the exact lower boundary is not known; the feature begins somewhere prior to position 345 but ends at position 500 |
345..>500
|
A continuous range of positions, where the exact upper boundary is not known; the feature begins at position 345 but ends somewhere after position 500 |
<1..888
|
The feature starts before the first sequenced base and continues to position 888 |
(102.110)
|
Indicates that the exact location is unknown, but that it is one of the positions between 102 and 110, inclusive |
123^124
|
Points to a site between positions 123 and 124 |
123^177
|
Points to a site between two adjacent nucleotides or amino acids anywhere between positions 123 and 177 |
join(12..78,134..202)
|
Regions 12–78 and 134–202 are joined to form one contiguous sequence |
complement(4918..5126)
|
The sequence complementary to that found from 4918 to 5126 in the sequence record |
J00194:100..202
|