Statistical Significance Testing for Natural Language Processing. Rotem Dror

Читать онлайн книгу.

Statistical Significance Testing for Natural Language Processing - Rotem Dror


Скачать книгу
tests, our discussion of these tasks is quite shallow. Furthermore, when we analyze experimental results with NLP tasks in Chapters 5 and 6 we do not provide the details of the tasks because we assume the reader is familiar with the basic tasks of NLP. Despite these assumptions about the reader’s background, we are trying as much as possible to be self-contained when it comes to statistical hypothesis testing and the derived concepts and methodology, as presenting these ideas to the NLP audience is a core objective of this book.

      Further Reading For broader and more in-depth reading on the fundamental concepts of statistics, we refer the reader to other existing resources such as Montgomery and Runger [2007] (which provides an engineering perspective) and Johnson and Bhattacharyya [2019]. For further reading on the topic of multiple comparisons in statistics, we recommend the book by Bretz et al. [2016] which demonstrates the basic concepts and provides examples with R code.

      This book evolved from a series of conference and journal papers—Dror et al. [2017], Dror et al [2018], Dror et al. [2019]—which have been greatly expanded in order to form this book. First, we added background chapters that discuss the foundations of statistical hypothesis testing and provide the details of the statistical significance tests that we find most relevant for NLP. Then, we take the handbook approach and provide the pseudocode of the various methods discussed throughout the book, along with concrete recommendations and guidelines—our goal is to allow the practitioner to directly and easily implement the methods described in this book. Finally, in Chapter 7, we critically discuss the ideas presented in this book and point to challenges that are yet to be addressed in order to perform statistically sound analysis of NLP experimental results.

       FOCUS OF THIS BOOK

      This book is intended to be self-contained, presenting the framework of statistical hypothesis testing and its derived concepts and methodology in the context of NLP research. However, the main focus of the book is on this statistical framework and its application to the analysis of NLP experimental results, rather than on providing in-depth coverage of the NLP field.

      Most of the book takes the handbook approach and aims to provide concrete solutions to practical problems. As such, it does not provide in-depth technical coverage of statistical hypothesis testing to a level that will allow the reader to propose alternative solutions to those proposed here, or to solve some of the open challenges we point to. Yet, our hope is that highlighting the challenges of statistically sound evaluation of NLP experiments, both those that already have decent solutions and those that are still open, will attract the attention of the community to these issues and facilitate future development of additional methods and techniques.

      Rotem Dror, Lotem Peled-Cohen, Segev Shlomov, and Roi Reichart

      April 2020

      This book is an outcome of three years of exploration. The journey started with a course by Dr. Marina Bogomolov on multiple hypothesis testing, which was given in the fall of 2017 at the Faculty of Industrial Engineering and Management (IE&M) of the Technion. Marina, as well as Gili Baumer, her M.Sc. student and the tutor of the course at the time, were instrumental in the research that resulted in Chapter 6 of this book.

      Many people commented on the ideas we discuss in the book, read drafts of the papers that were eventually extended into this book as well as versions of the book itself, and provided valuable feedback. Among these are David Azriel, Eustasio Del Barrio, Yuval Pinter, David Traum (who, as the program chair of ACL 2019, made a substantial contribution to the shaping of our ideas in Chapter 5), Or Zuk, and the members of the Natural Language Processing Group of the IE&M Faculty of the Technion: Reut Apel, Chen Badler, Eyal Ben David, Amichay Doitch, Ofer Givoli, Amir Feder, Ira Leviant, Rivka (Riki) Malka, Nadav Oved, Guy Rotman, Ram Yasdi, Yftah Ziser, and Dor Zohar.

      The anonymous reviewers of the book and original papers provided detailed comments on various aspects of this work, from minor technical details to valuable suggestions on the structure, that dramatically improved its quality. Graeme Hirst, Michael Morgan, and Christine Kiilerich orchestrated the book-writing effort and provided valuable guidance.

      Finally, we would like to thank the generous support of the Technion Graduate School. Rotem Dror has also been supported by a generous Google Ph.D. fellowship.

      Needless to say that all the mistakes and shortcomings of the book are ours. Please let us know if you find any.

      Rotem Dror, Lotem Peled-Cohen, Segev Shlomov, and Roi Reichart

      April 2020

      Introduction

      The field of Natural Language Processing (NLP) has made substantial progress in the last two decades. This progress stems from multiple reasons: the data revolution that has made abundant amounts of textual data from a variety of languages and linguistic domains available, the development of increasingly effective predictive statistical models, and the availability of hardware that can apply these models to large datasets. This dramatic improvement in the capabilities of NLP algorithms carry the potential for a great impact.

      The extended reach of NLP algorithms has also resulted in NLP papers giving more and more emphasis to the experiment and result sections by showing comparisons between multiple algorithms on various datasets from different languages and domains. It can be safely argued that the ultimate test for the quality of an NLP algorithm is its performance on well-accepted datasets, sometimes referred to as “leader-boards”. This emphasis on empirical results highlights the role of statistical significance testing in NLP research: If we rely on empirical evaluation to validate our hypotheses and reveal the correct language processing mechanisms, we better be sure that our results are not coincidental.

      The goal of this book is to discuss the main aspects of statistical significance testing in NLP. Particularly, we aim to briefly summarize the main concepts so that they are readily available to the interested researcher, address the key challenges of hypothesis testing in the context of NLP tasks and data, and discuss open issues and the main directions for future work.

      We start with two introductory chapters that present the basic concepts of statistical significance testing: Chapter 2 provides a brief presentation of the hypothesis testing framework and Chapter 3 introduces common statistical significance tests. Then, Chapter 4 discusses the application of statistical significance testing to NLP. In this chapter we assume that two algorithms are compared on a single dataset, based on a single output that each of them produces, and discuss the relevant significance tests for various NLP tasks and evaluation measures. The chapter puts an emphasis on the aspects in which NLP tasks and data differ from common examples in the statistical literature, e.g., the non-Gaussian distribution of the data and the dependence between the participating examples, e.g., sentences in the same corpus. This chapter, that extends our ACL 2018 paper [Dror et al, 2018], provides our recommended matching between NLP tasks with their evaluation measures and statistical significance tests.

      The next two chapters relax two of the basic assumptions of Chapter 4: (a) that each of the compared algorithms produces a single output for each test example (e.g., a single parse tree for a given input sentence); and (b) that the comparison between the two algorithms is performed on a single dataset. Particularly, Chapter 5 addresses the comparison between two algorithms based on multiple solutions where each of them produces for a single dataset, while Chapter 6 addresses the comparison between two algorithms across several datasets.

      The first challenge stems from the recent emergence of


Скачать книгу