Natural Language Processing for the Semantic Web. Diana Maynard
Читать онлайн книгу.methods and tools that fit most appropriately for a particular scenario. Ultimately, the reader should come away armed with the knowledge to understand the main principles and, if necessary, to choose suitable NLP technologies that can be used to enhance their Semantic Web applications.
The overall structure of the book is as follows. We first describe some of the core low-level components, in particular those which are commonly found in open source NLP toolkits and used widely in the community. We then show how these tools can be combined and used as input for the higher-level tasks such as Information Extraction, semantic annotation, social media analysis, and opinion mining, and finally how applications such as semantically enhanced information retrieval and visualization, and the modeling of online communities, can be built on top of these.
One point we should make clear is that when we talk about NLP in this book, we are referring principally to the subtask of Natural Language Understanding (NLU) and not to the related subtask of Natural Language Generation (NLG). While NLG is a useful task which is also relevant to the Semantic Web, for example in relaying the results of some application back to the user in a way that they can easily understand, and particularly in systems that require voice output of results, it goes outside the remit of this book, as it employs some very different techniques and tools. Similarly, there are a number of other tasks which typically fall under the category of NLP but are not discussed here, in particular those concerned with speech rather than written text. However, many applications for both speech processing and natural language generation make use of the low-level NLP tasks we describe. There are also some high-level NLP-based applications that we do not cover in this book, such as Summarization and Question Answering, although again these make use of the same low-level tools.
Most early NLP tools such as parsers (e.g., Schank’s conceptual dependency parser [1]) were rule-based, due partly to the predominance of certain linguistic theories (primarily those of Noam Chomsky [2]), but also due to the lack of computational power which made machine learning methods infeasible. In the 1980s, machine learning systems started coming to the fore, but were still mainly used just to automatically create sets of rules similar to existing manually developed rule systems, using techniques such as decision trees. As statistical models became more popular, particularly in fields such as Machine Translation and Part-of-Speech tagging, where hard rule-based systems were often not sufficient to resolve ambiguities, Hidden Markov Models (HMMs) became popular, introducing the idea of weighted features and probablistic decision-making. In the last few years, deep learning and neural networks have also become very popular, following their spectacular success in the field of image recognition and computer vision (for example in the technology behind self-driving cars), although their success for NLP tasks is currently nowhere near as dramatic. Deep learning is essentially a branch of Machine Learning that uses multiple hierarchical levels of features that are learned in an unsupervised fashion. This makes it very suitable for working with big data, because it is fast and efficient, and does not require the manual creation of training data, unlike supervised machine learning systems. However, as will be demonstrated throughout the course of this book, one of the problems of NLP is that tools almost always need adapting to specific domains and tasks, and for real-world applications this is often easier with rule-based systems. In most cases, combinations of different methods are used, depending on the task.
1.1 INFORMATION EXTRACTION
Information extraction is the process of extracting information and turning it into structured data. This may include populating a structured knowledge source with information from an unstructured knowledge source [3]. The information contained in the structured knowledge base can then be used as a resource for other tasks, such as answering natural language queries or improving on standard search engines with deeper or more implicit forms of knowledge than that expressed in the text. By unstructured knowledge sources, we mean free text, such as that found in newspaper articles, blog posts, social media, and other web pages, rather than tables, databases, and ontologies, which constitute structured text. Unless otherwise specified, we use the word text in the rest of this book to mean unstructured text.
When considering information contained in text, there are several types of information that can be of interest. Often regarded as the key components of text are proper names, also called named entities (NEs), such as persons, locations, and organizations. Along with proper names, temporal expressions, such as dates and times, are also often considered as named entities. Figure 1.1 shows some simple Named Entities in a sentence. Named entities are connected together by means of relations. Furthermore, there can be relations between relations, for example the relation denoting that someone is CEO of a company is connected to the relation that someone is an employee of a company, by means of a sub-property relation, since a CEO is a kind of employee. A more complex type of information is the event, which can be seen as a group of relations grounded in time. Events usually have participants, a start and an end date, and a location, though some of this information may be only implicit. An example for this is the opening of a restaurant. Figure 1.2 shows how entities are connected to form relations, which form events when grounded in time.
Figure 1.1: Examples of named entities.
Figure 1.2: Examples of relations and events.
Information extraction is difficult because there are many ways of expressing the same facts:
• BNC Holdings Inc. named Ms. G. Torretta as its new chairman.
• Nicholas Andrews was succeeded by Gina Torretta as chairman of BNC Holdings Inc.
• Ms. Gina Torretta took the helm at BNC Holdings Inc.
Furthermore, information may need to be combined across several sentences, which may additionally not be consecutive.
• After a long boardroom struggle, Mr. Andrews stepped down as chairman of BNC Holdings Inc. He was succeeded by Ms. Torretta.
Information extraction typically consists of a sequence of tasks, comprising:
1. linguistic pre-processing (described in Chapter 2);
2. named entity recognition (described in Chapter 3);
3. relation and/or event extraction (described in Chapter 4).
Named entity recognition (NER) is the task of recognizing that a word or a sequence of words is a proper name. It is often solved jointly with the task of assigning types to named entities, such as Person, Location, or Organization, which is known as named entity classification (NEC). If the tasks are performed at the same time, this is referred to as NERC. NERC can either be an annotation task, i.e., to annotate a text with NEs, or the task can be to populate a knowledge base with these NEs. When the named entities are not simply a flat structure, but linked to a corresponding entity in an ontology, this is known as semantic annotation or named entity linking (NEL). Semantic annotation is much more powerful than flat NE recognition, because it enables inferences and generalizations to be made, as the linking of information provides access to knowledge not explicit in the text. When semantic annotation is part of the process, the information extraction task is often referred to as Ontology-Based Information Extraction (OBIE) or Ontology Guided Information Extraction (see Chapter 5). Closely associated with this is the process of ontology learning and population (OLP) as described in Chapter 6. Information extraction tasks are also a pre-requisite