2009 IntroToLinguisticAnnotation
- (Wilcock, 2009) ⇒ Graham Wilcock. (2009). “Introduction to Linguistic Annotation and Text Analytics.” In: Synthesis Lectures on Human Language Technologies. Morgan & Claypool. doi:10.2200/S00194ED1V01Y200905HLT003 ISBN:1598297384
Subject Headings: Linguistic Annotation Task.
Notes
- It is a Book on Linguistic Annotation with a focus on encoding with XML.
- It describes the use of: WordFreak, OpenNLP, Stanford NLP Tools, UIMA, GATE.
- Companion website is at http://sites.morganclaypool.com/wilcock.
Cited By
Quotes
Author Keywords
- Linguistic Annotation, Statistical Natural Language Processing, Part-of-Speech Tagging, Named Entity Recognition, Information Extractions, Text Analytics.
Abstract
Linguistic annotation and text analytics are active areas of research and development, with academic conferences and industry events such as the Linguistic Annotation Workshops and the annual Text Analytics Summits. This book provides a basic introduction to both fields, and aims to show that good linguistic annotations are the essential foundation for good text analytics.
After briefly reviewing the basics of XML, with practical exercises illustrating in-line and stand-off annotations, a chapter is devoted to explaining the different levels of linguistic annotations. The reader is encouraged to create example annotations using the WordFreak linguistic annotation tool. The next chapter shows how annotations can be created automatically using statistical NLP tools, and compares two sets of tools, the OpenNLP and Stanford NLP tools.
The second half of the book describes different annotation formats and gives practical examples of how to interchange annotations between different formats using XSLT transformations. The two main text analytics architectures, GATE and UIMA, are then described and compared, with practical exercises showing how to configure and customize them. The final chapter is an introduction to text analytics, describing the main applications and functions including named entity recognition, coreference resolution and information extraction, with practical examples using both open source and commercial tools. Copies of the example files, scripts, and stylesheets used in the book are available from the companion website, located at the book website, located at http://sites.morganclaypool.com/wilcock.
Table of Contents
- 1. Working with XML.
- 1.1 Introduction p.1
- 1.2 XML Basics p.2
- 1.3 XML Parsing and Validation p.3
- 1.4 XML Transformations p.9
- 1.5 In-Line Annotations p.11
- 1.6 Stand-Off Annotations p.14
- 1.7 Annotation Standards p.18
- 1.8 Further Reading p.18
- 2 Linguistic Annotation p.19
- 2.1 Levels of Linguistic Annotation.
- 2.2 WordFreak Annotation Tool.
- 2.3 Sentence Boundaries p.22
- 2.4 Tokenization p.24
- 2.5 Part-of-Speech Tagging p.27
- 2.6 Syntactic Parsing p.30
- 2.7 Semantics and Discourse p.33
- 2.8 WordFreak with OpenNLP.
- 2.9 Further Reading p.42
- 3 Using Statistical NLP Tools p.45
- 3.1 Statistical Models p.45
- 3.2 OpenNLP and Stanford NLP Tools p.46
- 3.3 Sentences and Tokenization p.46
- 3.4 Statistical Tagging p.48
- 3.5 Chunking and Parsing p.49
- 3.6 Named Entity Recognition p.55
- 3.7 Coreference Resolution p.59
- 3.8 Further Reading p.61
- 4 Annotation Interchange p.63
- 4.1 XSLT Transformations p.63
- 4.2 WordFreak-OpenNLP Transformation p.68
- 4.3 GATE XML Format p.71
- 4.4 GATE-WordFreak Transformation p.75
- 4.5 XML Metadata Interchange: XMI p.81
- 4.6 WordFreak-XMI Transformation p.84
- 4.7 Towards Interoperability p.91
- 4.8 Further Reading p.93
- 5 Annotation Architectures p.95
- 5.1 GATE p.95
- 5.2 GATE Information Extraction Tools p.97
- 5.3 Annotations with JAPE Rules p.100
- 5.4 Customizing GATE Gazetteers p.103
- 5.5 UIMA p.107
- 5.6 UIMAWrappers for OpenNLP Tools p.108
- 5.7 Annotations with Regular Expressions p.113
- 5.8 Customizing UIMA Dictionaries p.115
- 5.9 Further Reading p.118
- 6 Text Analytics p.119
- 6.1 Text Analytics Tools p.119
- 6.2 Named Entity Recognition p.122
- 6.3 Training Statistical Models p.128
- 6.4 Coreference Resolution p.133
- 6.5 Information Extraction p.136
- 6.6 Text Mining and Searching p.142
- 6.7 New Directions p.145
1. Working with XML
... In general, annotations are notes of some kind that are attached to an object of some kind. In this book, the objects that are annotated are texts. Linguistic annotations are notes about linguistic features of the annotated text that give information about the words and sentences of the text. …
1.1 Introduction
1.2 XML Basics
1.3 XML Parsing and Validation
1.4 XML Transformations
1.5 In-Line Annotations
1.6 Stand-Off Annotations
1.7 Annotation Standards
1.8 Further Reading
2 Linguistic Annotation
2.1 Levels of Linguistic Annotation
In linguistic theory, the analysis and description of linguistic phenomena are usually organized into several distinct levels. The different sounds used by a language are described at the level of phonology. The writing system is described at the level of orthography. Morphology describes the formation and inflection of individual words. Syntax describes the ordering of words and their combination into phrases and sentences. Semantics analyzes the meaning of individual words (lexical semantics) and the meaning of phrases and sentences (compositional semantics). How words and phrases are actually used to make things happen is the level of pragmatics. How people and things are introduced as topics and subsequently referred to in later utterances is the level of discourse.
The different levels of linguistic description can be thought of as layers, as shown in Figure 2.1. Phonology and orthography deal with the smallest units (individual sounds and letters) at the bottom. Morphology, syntax and semantics deal with the medium-sized units (words, phrases and sentences). Discourse and pragmatics deal with the largest units (whole paragraphs and dialogues) at the top.
discourse | cohesion in a text or dialogue |
pragmatics | functions of utterances |
semantics | meaning of words and sentences |
syntax | word order and sentence structure |
morphology | word formation and inflections |
orthography | spelling (written language) |
phonology | sounds (spoken language) |
The current state of the art in linguistic annotation also divides the different annotation tasks into different levels, which can be arranged into a similar set of layers as shown in Figure 2.2. However, there is only an approximate correspondence between the levels of the tasks performed in practical corpus annotation work and the levels of description in linguistic theory.
coreference resolution | linking references to same entities in a text |
named entity recognition | identifying and labeling named entities |
semantic analysis | labeling predicate-argument relations |
syntactic parsing | analyzing constituent phrases in a sentence |
part-of-speech tagging | labeling words with word categories |
tokenization | segmenting text into words |
sentence boundaries | segmenting text into sentences |
This book focusses on the annotation of texts, where the language is written not spoken, so we do not include an annotation level matching phonology. The annotation tasks that deal with the level of orthography are tokenization and sentence boundary detection. These tasks segment the text into distinct words (tokens) and distinct sentences. It does not usually matter which of these two tasks is performed first, but it is important that both tasks are performed before the higher-level tasks are done.
2.2 WordFreak Annotation Tool
There are many tools that can be used for linguistic annotation. We will use WordFreak (http://wordfreak.sourceforge.net/), a Java-based linguistic annotation tool designed to support both human and automatic annotation of linguistic data. WordFreak is briefly described by its developers Thomas Morton and Jeremy LaCivita in (Morton and LaCivita 2003). There is no user manual, so we will give detailed examples here.
We use WordFreak in order to gain practical experience of doing linguistic annotations by hand. That’s the only way to learn the difficulties involved in making decisions in linguistic annotations. Later, when we use statistical NLP tools, we will appreciate the speed and power of automatic annotations, by contrast with manual annotations.
As an example text, we will use Shakespeare’s Sonnet 130. Figure 2.3 shows sonnet130.txt, a plain text version of Sonnet 130.
WordFreak creates stand-off XML annotations. We will describe the format and see examples in the following sections. Note that GATE and WordFreak deal with existing annotations differently.
2.3 Sentence Boundaries
2.4 Tokenization
2.5 Part-of-Speech Tagging
2.6 Syntactic Parsing
2.7 Semantics and Discourse
2.8 WordFreak with OpenNLP
We have learned in the practical work that doing linguistic annotations by hand is a slow process. In this section we combine WordFreak with automatic tagging and parsing tools to do linguistic annotations much faster.We will learn more about statistical annotation tools in Chapter 3.
Automatic annotation tools inevitably make some mistakes, but the errors can be corrected by hand using the WordFreak user interface. The combination of high-speed automatic annotation and high-quality human checking and correction can be a good solution for some annotation tasks.
2.9 Further Reading
3 Using Statistical NLP Tools
3.1 Statistical Models
3.2 OpenNLP and Stanford NLP Tools
3.3 Sentences and Tokenization
3.4 Statistical Tagging
3.5 Chunking and Parsing
3.6 Named Entity Recognition
3.7 Coreference Resolution
3.8 Further Reading
4 Annotation Interchange
4.1 XSLT Transformations
4.2 WordFreak-OpenNLP Transformation
4.3 GATE XML Format
4.4 GATE-WordFreak Transformation
4.5 XML Metadata Interchange: XMI
4.6 WordFreak-XMI Transformation
4.7 Towards Interoperability
4.8 Further Reading
5 Annotation Architectures
5.1 GATE
5.2 GATE Information Extraction Tools
5.3 Annotations with JAPE Rules
5.4 Customizing GATE Gazetteers
5.5 UIMA
5.6 UIMAWrappers for OpenNLP Tools
5.7 Annotations with Regular Expressions
5.8 Customizing UIMA Dictionaries
5.9 Further Reading
6 Text Analytics
6.1 Text Analytics Tools
6.2 Named Entity Recognition
6.3 Training Statistical Models
6.4 Coreference Resolution
6.5 Information Extraction
6.6 Text Mining and Searching
6.7 New Directions
,
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2009 IntroToLinguisticAnnotation | Graham Wilcock | Introduction to Linguistic Annotation and Text Analytics | Synthesis Lectures on Human Language Technologies | http://books.google.com/books?id=TDQJb1UgVywC | 10.2200/S00194ED1V01Y200905HLT003 | 2009 |