Natural Language Processing for the Semantic Web. Diana Maynard
Читать онлайн книгу.Morphological analysis essentially concerns the identification and classification of the linguistic units of a word, typically breaking the word down into its root form and an affix. For example, the verb walked comprises a root form walk and an affix -ed. In English, morphological analysis is typically applied to verbs and nouns, because these may appear in the text as variants created by inflectional morphology. Inflectional morphology refers to the different forms of words reflected by mood, tense, number, and so on, such as the past tense of a verb or the plural of a noun. Inflection in English is typically expressed by adding a suffix to the root form (e.g., walk, walked, box, boxes) or another internal modification such as a vowel change (e.g., run, ran, goose, geese). In other languages, prefixes (adding to the beginning of a word), infixes (adding in the middle of a word), and other changes may be used. Some morphological analysis tools represent these internal modifications as an alternative representation of the default affix. What we mean by this is that if the plural of a noun is commonly represented by adding -s as a suffix, the output of the tool will show the value of the affix as -s even in the case of plural forms such as geese. Essentially, it treats an irregular vowel change form simply as a kind of surface representational variant of the standard affix. The GATE morphological analyzer, for example, depicts the word geese as having the root goose and affix -s.
Typically, NLP tools which perform morphological analysis deal only with inflectional morphology, as described above, but do not handle derivational morphology. Derivation is the process of adding derivational morphemes, which create a new word from existing words, usually involving a change in grammatical category (for example, creating the noun worker from the verb work, or the noun loudness from the adjective loud.
Morphological analyzers for English are often rule-based, since the majority of inflectional variants follow grammatical rules and set patterns (for example, plural nouns are typically created by adding -s or -es to the end of the singular noun). Exceptions can also be handled quite easily by rules, and unknown words are assumed to follow default rules. The English morphological analyzer in GATE is rule-based, with the rule language (flex) supporting rules and variables that can be used in regular expressions in the rules. POS tags can be taken into account if desired, depending on a configuration parameter. The analyzer takes as input a tokenized document, and considering one token and its POS tag at a time, it identifies its lemma and affix. These values are than added as features of the token.
The Stanford Morphology tool also uses a rule-based approach, is based on a finite-state transducer, and is written in flex. Unlike the GATE tool, however, it requires the use of POS tags as well as tokens, and generates lemmas but not affixes.
NLTK provides an implementation of morphological analysis based on WordNet’s built-in morphy function. WordNet [16] is a large lexical database of English resembling a thesaurus, where nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The morphy function is designed to allow users to query an inflectional form against a base form listed in WordNet. It uses a rule-based method involving lists of inflectional endings, based on syntactic category, and an exception list for each syntactic category, in which a search for an inflected form is done. Like the Stanford tool, it returns only the lemma but not the affix. Furthermore, it can only handle words present in WordNet.
OpenNLP does not currently provide any tools for morphological analysis.
2.7.1 STEMMING
Stemmers produce the stem form of each word, e.g., driving and drivers have the stem drive, whereas morphological analysis tends to produce the root/lemma forms of the words and their affixes, e.g., drive and driver for the above examples, with affixes -ing and -s respectively. There is much confusion about the difference between stemming and morphological analysis, due to the fact that stemmers can vary considerably in how they operate and in their output. In general, stemmers do not attempt to perform an analysis of the root or stem and its affix, but simply strip the word down to its stem. The main way in which stemmers themselves vary is due to the presence or absence of the constraint that the stem must also be a real word in the given language. Basic stemming algorithms simply strip off the affix, e.g., driving would be stripped to the stem driv-by removing the suffix -ing. The distinction between verbs and nouns is often not maintained, so both driver and driving would be stripped down to the stem driv-. Information retrieval (IR) systems often make use of this kind of suffix stripping, since it can be performed by a simple algorithm and does not require other linguistic pre-processing such as POS tagging. Stemming is useful for IR systems because it brings together lexico-syntactic variants of a word which have a common meaning (so one can use either the singular or plural form of a word in the search query, and it will match against either form in a web page). Note that unlike most morphological analysis tools, stemming tools may also consider variants arising from derivational morphology, since they ignore the syntactic category of the word. A further difference is that typically, stemmers do not refer to the context surrounding the word, but only to the word in isolation, while morphological analyzers may also use the context.
Figure 2.5 shows an example of how stemming and morphological analysis may differ. The stemmer in GATE strips off the derivational affix -ness, reducing the noun loudness to the base adjective loud, as shown by the stem feature. The morphological analyzer, on the other hand, is not concerned with derivational morphology, and leaves the word in its entirety, as shown by the root feature loudness and producing a zero affix.
Figure 2.5: Comparison of stemming and morphological analysis in GATE.
Suffix-stripping algorithms may differ in results for a variety of reasons. One such reason is whether the algorithm constrains whether the output word must be a real word in the given language. Some approaches do not require the word to actually exist in the language lexicon (the set of all words in the language).
The most well-known stemming algorithm is the Porter Stemmer [17], which has been re-implemented in many forms. Due to the problems of many different variants being created, Porter later invented the Snowball language, which is a small string processing language designed specifically for creating stemming algorithms for use in Information Retrieval. A variety of useful open-source stemmers for many languages have since been created in Snowball. GATE provides a wrapper for a number of these, covering 11 European languages, while NLTK provides an implementation of them for Python. Because the stemmers are rule-based and easy to modify, following Porter’s original approach, this makes them very straightforward to combine with the other low-level linguistic components described previously in this chapter. OpenNLP and Stanford CoreNLP do not provide any stemmers.
2.8 SYNTACTIC PARSING
Syntactic parsing is concerned with analysing sentences to derive their syntactic structure according to a grammar. Essentially, parsing explains how different elements in a sentence are related to each other, such as how the subject and object of a verb are connected. There are many different syntactic theories in computational linguistics, which posit different kinds of syntactic structures. Parsing tools may therefore vary widely not only in performance but in the kind of representation they generate, based on the syntactic theory they make use of.
Freely available wide-coverage parsers include the Minipar7 dependency parser, the RASP [18] statistical parser, the Stanford [19] statistical parser, and the general-purpose SUPPLE parser [20]. These are all available within GATE, so that the user can try them all and decide which is the most appropriate for their needs.
Minipar is