2009 WibNEDWikipediaBasedNED

(Gentile et al., 2009) ⇒ Anna L. Gentile, Pierpaolo Basile, Giovanni Semeraro. (2009). “WibNED Wikipedia Based Named Entity Disambiguation.” In: Proceedings of the 5th Italian Research Conference on Digital Libraries (IRCDL 2009).

Subject Headings: Entity Mention Normalization Algorithm, Lesk Algorithm, (Wikipedia, 2009), Yamcha System.

Notes

It adapts the Lesk Algorithm (for Word Sense Disambiguation) to Named Entity Mention Disambiguation.
It uses (Wikipedia, 2009) ⇒ as the Sense Inventory/Entity Database.
It does not disambiguate Common Nouns, such as the “Drosophila Melanogaster.” (Oddly, they suggest that the word Muscomorpha is not a Proper Noun (see http:/w/en.wikipedia.org/wiki/Muscomorpha).
It cannot handle missing Entity Records. It always picks the nearest Entity Record.
It uses the Yamcha Chunking System.

Cited By

http://www.di.uniba.it/~lops/IIANLP/materiale/IIA_NED.pdf

Quotes

Abstract.

Natural Language is a mean to express and discuss concepts, which are taken to be abstractions from perceptions of the experienced real world: what texts describe consist of objects and events. Objects of the real world are identified by proper names, which are words, thus raising the problem of proper linkage between the textual reference and the real object. This work addresses the problem of automatically association of meanings to words within an unstructured text and focuses the attention on words representing Named Entities. The proposed solution consists of a Knowledge-based algorithm for Named Entity Disambiguation: we used an ad hoc builded corpus, extracted form Wikipedia’s articles to prove the soundness of the algorithm.

1 Introduction

Many philosophical theories have been proposed about proper names. The descriptive theory of proper names is the view that the meaning of a given use of a proper name is a set of properties that can be expressed as a description that picks out an object that satisfies the description. Gottlob Frege in Sinn Und Bedeutung [5], which can be translated in Sense and Reference, stated that sense and reference are two different aspects of the significance of an expression. A proper name, according to Frege, has a reference (Bedeutung) and a sense (Sinn). The reference is the object that the expression refers to (different linguistic expressions can have the same reference). The sense is the cognitivesignificance, the way by which the referent is presented. Linguistic Expressions with the same reference may have different senses. In contrast to this theory, the causal theory of names combines the referential view with the idea that the name’s referent is fixed by a baptismal act, whereupon the name becomes a rigid designator of the referent. Subsequent uses of the name succeed in referring to the referent by being linked by a causal chain to that original baptismal act. The major representative of this theory was Saul Kripke, with his three lectures Naming and Necessity [9].
When translating the naming problem from the philosophical field to Computer Science it is natural to cope with Natural Language Processing techniques. NLP steps include text normalization, tokenization, stop words elimination, stemming, Part Of Speech tagging, lemmatization. Further steps, as Word Sense Disambiguation (WSD) or Named Entity Recognition (NER), are aimed at enriching texts with semantic information. Entity Disambiguation resolves the correspondence between real-world entities and mentions within text. The proposed approach exploits freely available knowledge bases and automatically associates each entity in a text with a URI, using Wikipedia as ”entity-provider”.

2 Related Work

2.1 Named Entity Recognition

Named Entity Recognition (NER) involves the identification and classification of so called named entities: expressions that refer to people, places, organizations, products, companies, and even dates, times, or monetary amounts, as stated in the Message Understanding Conferences (MUC) [7]. Approaches to this task can be deductive, relying on heuristic rules, which are mainly regular expressions, or inductive, using machine learning to build recognition rules, instead of completely manually define them. Regardless of the adopted approach, NER can take advantage of external resources, such as dictionaries, lexical resources, encyclopaedias and so on, both to build recognition rules and to train learning methods with additional features. Commonly used resources are, among the others, WordNet and Wikipedia. The latter, in particular, can be considered not only as a source for semantic data, as suggested in [8], but also as an URIs provider.
Multilingual benchmarking and evaluations have been performed within several events, such as the Message Understanding Conferences (MUC) series organized by DARPA, the International Conference on Language Resources and Evaluation (LREC), the Computational Natural Language Learning (CoNLL) workshops, the Automatic Content Extraction (ACE) series organized by NIST, the Multilingual Entity Task Conference (MET), the Information Retrieval and Extraction Exercise (IREX). Evaluation for Italian language has been performed in the context of Evalita (http://evalita.itc.it/), using part of the Italian Content Annotation Bank (I-CAB at http://tcc.itc.it/projects/ontotext/i-cab/download-icab.html) as evaluating corpus.

2.2 Named Entity Disambiguation

Entity disambiguation is the problem of determining whether two mentions of entities refer to the same object. After performing NER, it is useful to collect properties of the identified entities, particularly those that help the most to discriminate between individuals. Lexical and encyclopaedic knowledge can turn out to be relevant for this task.
A recent work by Bunescu and Paşca [3] faced the problem of Named Entity Disambiguation as a ranking problem, defining a cosine-based similarity function. Wikipedia has been used as a dataset of disambiguated occurrences of proper names and the context article for each entity has been taken into account within the similarity function. Silviu Cucerzan proposed the Vector Space Model as a solution for the NED problem: the vectorial representation of the document is comperad with the vectorial representation of the Wikipedia entities [4].

A particular kind of Entity Disambiguation, disambiguation of person names in a Web searching scenario, has been faced in one of SEMEVAL 2007 task4, with the task of grouping documents referring to the same individual.

2.3 Open Access Resources

Two Open Access Resources have been taken into account within this work: Wikipedia and WordNet [12] .
An important role in NLP is played by Wikipedia which is a free, multilingual, open content encyclopedia project. Its name is a portmanteau of the words wiki (a technology for creating collaborative websites) and encyclopedia: articles have been written collaboratively by volunteers around the world. Wikipedia has been launched in 2001 by Jimmy Wales and Larry Sanger and now is the most popular general reference work currently available on the Internet. Wikipedia has steadily gained status as a general reference website and its content has also been used in academic studies, books and conferences5. Many studies that try to exploit Wikipedia as a knowledge source have recently emerged [13] [14] [16]. In particular, for the problem of entities Bunescu and Paşca exploited internal links in Wikipedia as training examples [3], Toral and Munoz [15] tried to extract gazetteers from Wikipedia by focusing on the first sentences, while Cucerzan used Wikipedia to extract entities to validate his proposal [4].
WordNet is a semantic lexicon for English language. It groups English words into sets of synonyms called synsets. It provides short, general definitions and it records the various semantic relations between these synonym sets (but also includes lexical relations between words). Every synset contains a group of synonymous words or collocations (a collocation is a sequence of words that go together to form a specific meaning, such as “car pool”); different senses of a word are in different synsets. The meaning of the synsets is further clarified with short defining glosses. In this workWordNet is used as a lexical rosource to accomplish NLP basic steps.

References

1. S. Banerjee and T. Pedersen. An adapted lesk algorithm for word sense disambiguation using wordnet. In CICLing ’02: Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing, pages 136–145, London, UK, (2002). Springer-Verlag.
2. P. Basile, M. de Gemmis, A.L. Gentile, L. Iaquinta, P. Lops, and G. Semeraro. Meta multilanguage text analyzer. In: Proceedings of the Language and Speech Technology Conference LangTech 2008, Rome, Italy, February 28-29, 2008, pages 137–140, 2008.
3. Razvan C. Bunescu and Marius Paşca. Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL-06), Trento, Italy, pages 9–16, April 2006.
4. S. Cucerzan. Large-scale named entity disambiguation based on Wikipedia data. In: Proceedings of EMNLP-CoNLL 2007, pages 708–716, 2007.
5. G. Frege. Uber sinn und bedeutung. In Mark Textor, editor, Funktion - Begriff - Bedeutung, volume 4 of Sammlung Philosophie. Vandenhoeck & Ruprecht, G”ottingen, 1892.
6. A. L. Gentile, P. Basile, L. Iaquinta, and G. Semeraro. Lexical and semantic resources for nlp: From words to meanings. In Knowledge based Intelligent Information and Engineering Systems, volume 5179/2008 of Lecture Notes in Computer Science, pages 277–284. Springer Berlin / Heidelberg, 2008.
7. R. Grishman and B. Sundheim. Message understanding conference- 6: A brief history. In COLING, pages 466–471, 1996.
8. J. Kazama and K. Torisawa. Exploiting wikipedia as external knowledge for named entity recognition. In Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 698–707, 2007.
9. S. A. Kripke. Naming and necessity. Blackwell Publishing, 1981.
10. Taku Kudo and Y. Matsumoto. Fast methods for kernel-based text analysis. In Erhard Hinrichs and Dan Roth, editors, Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 24–31, 2003.
11. Michael E. Lesk. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In: Proceedings of the Fifth International Conference on Systems Documentation, pages 24–26, Toronto, CA, (1986). ACM.
12. George A. Miller. Introduction to wordnet: an on-line lexical database. International Journal of Lexicography, 3(4):235–244, (1990). (Special Issue).
13. Simone P. Ponzetto, and Michael Strube. (2006). Exploiting semantic role labeling, WordNet and Wikipedia for Coreference Resolution.” In: Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics.
14. Michael Strube and S.P. Ponzetto. Wikirelate! computing semantic relatedness using wikipedia. In AAAI, pages 1419–1424. AAAI Press, 2006.
15. A. Toral and R. Munoz. A proposal to automatically build and maintain gazetteers for Named Entity Recognition by using Wikipedia. EACL 2006, 2006.
16. T. Zesch, I. Gurevych, and M. Mühlh¨auser. Analyzing and accessing wikipedia as a lexical semantic resource. In Biannual Conference of the Society for Computational Linguistics and Language Technology, 2007.

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2009 WibNEDWikipediaBasedNED	Anna L. Gentile Pierpaolo Basile Giovanni Semeraro			WibNED Wikipedia Based Named Entity Disambiguation		Proceedings of the 5th Italian Research Conference on Digital Libraries				2009