2005 ExtractingPersonalNamesfromEmai

From GM-RKB
Jump to navigation Jump to search

Subject Headings:

Notes

Cited By

Abstract

There has been little prior work on Named Entity Recognition for "informal" documents like email. We present two methods for improving performance of person name recognizers for email: email-specific structural features and a recall-enhancing method which exploits name repetition across multiple documents.

1. Introduction

Named entity recognition (NER), the identification of entity names in free text, is a well-studied problem. In most previous work, NER has been applied to news articles (e.g., (Bikel et al., 1999; McCallum and Li, 2003)), scientific articles (e.g., (Craven and Kumlien, 1999; Bunescu and Mooney, 2004)), or web pages (e.g., (Freitag, 1998)). These genres of text share two important properties: documents are written for a fairly broad audience, and writers take care in preparing documents. Important genres that do not share these properties include instant messaging logs, newsgroup postings and email messages. We refer to these genres as “informal” text. Informal text is harder to process automatically. Informal documents do not obey strict grammatical conventions. They contain grammatical and spelling errors. Further, since the audience is more restricted, informal documents often use group- and task-specific abbreviations and are not self-contained. Because of these differences, existing NER methods may require modifications to perform well on informal text.

In this paper, we investigate NER for informal text with an experimental study of the problem of recognizing personal names in email — a task that is both useful and non-trivial. An application of interest is corpus anonymization. Automatic or semiautomatic email anonymization should allow using large amounts of informal text for research purposes, for example, of medical files. Person-name extraction and other NER tasks are helpful for automatic processing of informal text for a large variety of applications (Culotta et al., 2004; Cohen et al., 2005). We first present four corpora of email text, annotated with personal names, each roughly comparable in size to the MUC-6 corpus1. We experimentally evaluate the performance of conditional random fields (CRF) (Lafferty et al., 2001), a state-of-the art machine-learning based NER methods on these corpora. We then turn to examine the special attributes of email text (vs. newswire) and suggest venues for improving extraction performance. One important observation is that email messages often include some structured, easy-to-recognize names, such as names within a header, names appearing in automatically-generated phrases, as well as names in signature files or sign-offs. We therefore suggest a set of specialized structural features for email; these features are shown to significantly improve performance on our corpora.

We also present and evaluate a novel method for exploiting repetition of names in a test corpus. Techniques for exploiting name repetition within documents have been recently applied to newswire text 1Two of these are publicly available. The others can not be distributed due to privacy considerations. (e.g., (Humphreys et al., 1998)), scientific abstracts (e.g., (Bunescu andMooney, 2004)) and seminar announcements (Sutton and Mccallum, 2004); however, these techniques rely on either NP analysis or capitalization information to pre-identify candidate coreferent name mentions, features which are not reliable in email. Furthermore, we argue that name repetition in email should be inferred by examining multiple documents in a corpus, which is not common practice. We therefore present an alternative efficient scheme for increasing recall in email, using the whole corpus. This technique is shown to always improve recall substantially, and to almost always improve F1 performance.

3. Existing NER Methods

In our first set of experiments we apply CRF, a machine-learning based probabilistic approach to labeling sequences of examples, and evaluate it on the problem of extracting personal names from email. Learning reduces NER to the task of tagging (i.e., classifying) each word in a document. We use a set of five tags, corresponding to (1) a one-token entity, (2) the first token of a multi-token entity, (3) the last token of a multi-token entity, (4) any other token of a multi-token entity and (5) a token that is not part of an entity.

The sets of features used are presented in Table 2. All features are instantiated for the focus word, as well as for a window of 3 tokens to the left and to the right of the focus word. The basic features include the lower-case value of a token t, and its capitalization pattern, constructed by replacing all capital letters with the letter “X”, all lower-case letters with “x”, all digits with “9” and compressing runs of the same letter with a single letter. The dictionary features define various categories of words including common words, first names, last names (footnote 3) and “roster names” (footnote 4) (international names list, where first and last names are mixed.) In addition, we constructed some composite dictionary features, as specified in Table 2: for example, a word that is in the first-name dictionary and is not in the common-words or lastname dictionaries is designated a ”sure first name”. The common-words dictionary used consists of base forms, conjugations and plural forms of common English words, and a relatively small ad-hoc dictionary representing words especially common in email (e.g., ”email”, ”inbox”). We also use small manually created word dictionaries of prefixes and suffixes indicative of persons (e.g., ”mr”, ”jr”), locations (e.g., ”ave”) and organizations (e.g., ”inc”).

Table 2: Feature sets

Basic Features
  • t, lexical value, lowercase (binary form, e.g. f(t=”hello”)=1), capitalization pattern of t (binary form, e.g. f(t.cap=x+)=1)
Dictionary Features
  • inCommon: t in common words dictionary
  • inFirst: t in first names dictionary
  • inLast: t in last names dictionary
  • inRoster: t in roster names dictionary
  • First: inFirst \ ¬isLast \ ¬inCommon
  • Last: ¬inFirst \ inLast \ ¬inCommon
  • Name: (First [ Last [ inRoster) \ ¬ inCommon
  • Title: t in a personal prefixes/suffixes dictionary
  • Org: t in organization suffixes dictionary
  • Loc: t in location suffixes dictionary
Email Features
  • t appears in the header
  • t appears in the “from” field
  • t is a probable “signoff” (≈ after two line breaks and near end of message)
  • t is part of an email address (regular expression)
  • does the word starts a new sentence (≈ capitalized after a period, question or exclamation mark)
  • t is a probable initial (X or X.)
  • t followed by the bigram ”and I”
  • t capitalized and followed by a pronoun within 15 tokens

5. Improving Recall With Inferred Dictionaries

Sequential learners of the sort used here classify tokens from each document independently; moreover, the classification of a word [math]\displaystyle{ w }[/math] is independent of the classification of other occurrences of [math]\displaystyle{ w }[/math] elsewhere in the document. That is, the fact that a word [math]\displaystyle{ w }[/math] has appeared somewhere in a context that clearly indicates that it is a name does not increase the probability that it will be classified as a name in other, more ambiguous contexts.

Recently, sequential learning methods have been extended to directly utilize information about name co-occurrence in learning the sequential classifier. This approach provides an elegant solution to modeling repetition within a single document. However, it requires identifying candidate related entities in advance, applying some heuristic. Thus, Bunescu & Mooney (2004) link between similar NPs (requiring their head to be identical), and Sutton and Mccallum (2004) connect pairs of identical capitalized words. Given that in email corpora capitalization patterns are not followed to a large extent, there is no adequate heuristic that would link candidate entities prior to extraction. Further, it is not clear if a collective classification approach can scale to modeling multiple-document repetition.

We suggest an alternative approach of recall-enhancing name matching, which is appropriate for email. Our approach has points of similarity to the methods described by Stevenson and Gaizauskas (2000), who suggest matching text against name dictionaries, filtering out names that are also common words or appear as non-names in high proportion in the training data. The approach described here is more systematic and general. In a nutshell, we suggest applying the noisy dictionary of predicted names over the test corpus, and use the approximate (predicted) name to non-name proportions over the test set itself to filter out ambiguous names. Therefore, our approach does not require large amount of annotated training data. It also does not require word distribution to be similar between train and test data. We will now describe our approach in detail.

5.1 Matching names from dictionary

First, we construct a dictionary comprised of all spans predicted as names by the learned model. For personal names, we suggest expanding this dictionary further, using a transformation scheme. …

}}

References

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2005 ExtractingPersonalNamesfromEmaiWilliam W. Cohen
Einat Minkov
Richard C. Wang
Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text10.3115/1220575.12206312005