Domain-Sensitive Temporal Tagging. Jannik Strötgen

Читать онлайн книгу.

Domain-Sensitive Temporal Tagging - Jannik Strötgen


Скачать книгу
foundations to fully understand the discipline of temporal tagging and the challenges that approaches to temporal tagging are faced with. For this, we survey annotation standards, evaluation methods, research competitions, and temporally annotated corpora.

      As introduced in the previous chapter, there are different types of temporal expressions: date, time, duration, and set expressions. In addition, temporal expressions can carry their meaning explicitly or implicitly, or they can be underspecified or relative to some context information. When addressing the task of temporal tagging, it is necessary that it is well defined: (i) what types of temporal expressions are “markable” [Ferro et al., 2005b] and should thus be annotated; (ii) what extents should be annotated; and (iii) how the semantics of the expressions can be captured by using normalization attributes requiring some values in a standard format. Thus, annotation standards with precise specifications are a prerequisite when dealing with the task of temporal tagging.

      Currently, there are two widely used annotation standards for annotating temporal expressions in documents: TIDES TIMEX2 [Ferro et al., 2001, 2005b] and TimeML [Pustejovsky et al., 2003a, 2005, 2010], a specification language for temporal annotation using TIMEX3 tags for temporal expressions. Both standards present guidelines for the annotation of temporal expressions, including how to determine the extents of expressions and their normalizations. In both cases, the normalization is defined according to the ISO 8601 standard for temporal information with some extensions. For instance, a date expression of granularity day is normalized in the format YYYY-MM-DD. Since all widely used annotated corpora (cf. Section 3.4) as well as all state-of-the-art systems (cf. Chapter 5) are based on either one of the two above-mentioned standards, we describe the details of both of them in the following.

       TIDES TIMEX2

      While there have been several TIMEX definitions reaching from extent-only coverage [see, e.g., Chinchor, 1998], up to inclusion of some normalization information [see, e.g., Mani and Wilson, 2000a, Setzer and Gaizauskas, 2000], the TIDES TIMEX2 definitions were the first annotation guidelines that were well defined with sufficient detail to become broadly accepted as a standard. The annotation guidelines are based on the principles that temporal expressions should be tagged “if a human can determine a value for [it]” and that the value “must be based on evidence internal to the document” [Ferro et al., 2001]. Covering extent and normalization information, both questions What is a temporal expression? and What is the meaning of a temporal expression? are addressed. For the normalization, TIMEX2 tags may contain the following attributes [Ferro et al., 2005b]:

      • VAL: a normalized form of the date/time [or duration/set];

      • MOD: captures temporal modifiers;

      • ANCHOR_VAL: a normalized form of an anchoring date/time [of a duration];

      • ANCHOR_DIR: the relative direction between VAL and ANCHOR_VAL; and

      • SET: identifies expressions denoting sets of times.

      Except for the SET attribute, there is no concrete attribute for the type of temporal expressions in general. Nevertheless, since it can be determined based on the VAL attribute whether an expression is a time, a date or a duration, the classification of temporal expressions into these four types is implicitly covered by TIMEX2 annotations. However, it is rather difficult to use TIMEX2 annotations if only the extraction and classification of temporal expressions is targeted without the full normalization of temporal expressions.

       TIMEML WITH TIMEX3 TAGS FOR TEMPORAL EXPRESSIONS

      TimeML, which has more recently been formalized to create the ISO standard ISO-TimeML1[Pustejovsky et al., 2010], is based on the TIDES standard and was developed to capture further types of temporal information in documents. In contrast to TIDES that has only one tag for temporal expressions, TimeML contains tags for annotating events, temporal links (i.e., temporal relations), and temporal signals in addition to the TIMEX3 tag for temporal expressions [Pustejovsky et al., 2003a, 2005, 2010]. In the following, we focus on a description of TimeML aspects that are relevant for the task of temporal tagging.

      Due to the fact that TimeML focuses on temporal information in general and not only temporal expressions, there are significant differences between TIMEX2 and TIMEX3. These differences concern both the attributes and the extents of temporal expressions. For example, events can be part of temporal expressions in TIMEX2 (<TIMEX2>two days after the revolution</TIMEX2>), while they are not part of temporal expressions following TimeML (<TIMEX3>two days</TIMEX3> after the revolution).

      In particular, specific types of pre- and post-modifiers of temporal expressions are part of TIMEX2 tags while in TimeML they are outside TIMEX3 tags [Mazur, 2012]. Such constructs are handled using the newly introduced tags for annotating relations between temporal expressions and events. In addition, TIMEX3 tags cannot be nested. However, TIMEX3 tags with no extent are introduced, for example, to deal with unspecified time points, which are sometimes needed to anchor durations. Note that despite the fact that such abstract tags, that is, annotations without any extent, are described in the TimeML annotation guidelines, they have not been used [cf. Mazur, 2012]—neither in annotated corpora nor by TIMEX3-compliant temporal taggers—until the Italian temporal tagging challenge EVENTI in 2014 [Caselli et al., 2014]. In addition, abstract tags have been annotated in the 2016 released MEANTIME corpus [Minard et al., 2016], which was developed in the context of the NewsReader project.2 Before that, empty TIMEX3 tags have been mostly ignored.

      To describe the semantics of temporal expressions, the most important attributes of TIMEX3 tags3 are:

      • type: defines whether the expression is of type date, time, duration, or set;

      • value: a normalized form of the expression;

      • mod: captures temporal modifiers;

      • quant and freq: specify the quantity and frequency of set expressions;

      • beginpoint and endpoint: anchor begin and end of a duration; and

      • tid: automatically assigned id number.

      While the attribute type—with possible values “date”, “time”, “duration”, and “set”—is newly introduced in TIMEX3, the attributes value and mod are similar to the VAL and MOD attributes in TIMEX2. These two attributes already capture a large part of the information of temporal expressions, and for many expressions—in particular for many date and time expressions—the value attribute is the only attribute besides type that is needed for normalization. This is also the reason why in several evaluations of temporal taggers, the value attribute is the focus of interest [see, e.g., UzZaman et al., 2013].

      In particular for explicit date and time expressions, forming the value attribute (or the VAL attribute in TIMEX2) is straightforward, for example, the values of the expressions “September 13, 2009” and “Oct 12, 2014 7:00 am” are 2009-09-13 and 2014-10-12T07:00, respectively. For underspecified and relative date and time expressions, setting the value attribute is more challenging, because the information covered by their own extents is not sufficient. In contrast, a reference time has to be used along with a temporal function to calculate the content of the value attribute. For instance, in a document published on November 27, 2014 (2014-11-27), the expression “yesterday” can be normalized to 2014-11-26.4

      Value attributes in TIMEX3 (as VAL attributes in TIMEX2) assigned to duration expressions start with “P” (period), followed by an amount and an abbreviated unit, e.g., the value of “three years” is P3Y. If the unit of the duration is smaller than a day, the value attribute starts with “PT” (period, time), e.g., PT5H for the expression “five hours”. Thus, the value attribute of durations represents the length


Скачать книгу