SensEval Benchmark Task

A SensEval Benchmark Task is an NLP benchmark task for word-sense disambiguation systems.

Context:
- It can be superseded by a SemEval Benchmark Task.
Example(s):
Counter-Example(s):
See: Semantic Analysis, Word-Sense Induction System, Natural Language Processing System , Machine Translation System, Information Retrieval System, Multilingual Semantic Textual Similarity.

References

2019a

(Wikipedia, 2019) ⇒ https://en.wikipedia.org/wiki/SemEval Retrieved:2019-11-10.
- SemEval (Semantic Evaluation) is an ongoing series of evaluations of computational semantic analysis systems; it evolved from the Senseval word sense evaluation series. The evaluations are intended to explore the nature of meaning in language. While meaning is intuitive to humans, transferring those intuitions to computational analysis has proved elusive.
  This series of evaluations is providing a mechanism to characterize in more precise terms exactly what is necessary to compute in meaning. As such, the evaluations provide an emergent mechanism to identify the problems and solutions for computations with meaning. These exercises have evolved to articulate more of the dimensions that are involved in our use of language. They began with apparently simple attempts to identify word senses computationally. They have evolved to investigate the interrelationships among the elements in a sentence (e.g., semantic role labeling), relations between sentences (e.g., coreference), and the nature of what we are saying (semantic relations and sentiment analysis).
  The purpose of the SemEval and Senseval exercises is to evaluate semantic analysis systems. “Semantic Analysis” refers to a formal analysis of meaning, and "computational" refer to approaches that in principle support effective implementation. ^[1] The first three evaluations, Senseval-1 through Senseval-3, were focused on word sense disambiguation, each time growing in the number of languages offered in the tasks and in the number of participating teams. Beginning with the fourth workshop, SemEval-2007 (SemEval-1), the nature of the tasks evolved to include semantic analysis tasks outside of word sense disambiguation.
  Triggered by the conception of the *SEM conference, the SemEval community had decided to hold the evaluation workshops yearly in association with the *SEM conference. It was also the decision that not every evaluation task will be run every year, e.g. none of the WSD tasks were included in the SemEval-2012 workshop.

↑ Blackburn, P., and Bos, J. (2005), Representation and Inference for Natural Language: A First Course in Computational Semantics, CSLI Publications. .

2019

(Mihalcea, 2019) ⇒ Rada Mihalcea (2019). Senseval: http://web.eecs.umich.edu/~mihalcea/senseval/ Retrieved:2019-11-10.
- QUOTE: There are now many computer programs for automatically determining the sense of a word in context (Word Sense Disambiguation or WSD). The purpose of Senseval is to evaluate the strengths and weaknesses of such programs with respect to different words, different varieties of language, and different languages.
  - [1] Senseval-1 took place in the summer of 1998 for English, French, and Italian, culminating in a workshop held at Herstmonceux Castle, Sussex, England on September 2-4.
  - [2] Senseval-2 took place in the summer of 2001, and was followed by a workshop held in July 2001 in Toulouse, in conjunction with ACL 2001. Senseval-2 included tasks for Basque, Chinese, Czech, Danish, Dutch, English, Estonian, Italian, Japanese, Korean, Spanish, Swedish.
  - [3] Senseval-3 took place in March-April 2004, followed by a workshop held in July 2004 in Barcelona, in conjunction with ACL 2004. Senseval-3 included 14 different tasks for core word sense disambiguation, as well as identification of semantic roles, multilingual annotations, logic forms, subcategorization acquisition.
  - [4] Semeval-2007 took place in July 2007, and was followed by a workshop held in conjunction with ACL in Prague. Semeval-1 included 18 different tasks targeting the evaluation of systems for the semantic analysis of text.
  - [5] Semeval-2010 took place in July 2010 in Uppsala, Sweden. Check the Semeval-2010 site for more information.

2001

(Edmonds & Cotton, 2001) ⇒ Phil Edmonds, and Scott Cotton. (2001). “Senseval-2: Overview.” In: Proceedings of the Second International Workshop on evaluating Word Sense Disambiguation Systems.
- The results of recent WSD exercises, e.g. Senseval-2 (Edmonds and Cotton, 2001) show clearly that Word Sense Disambiguation (WSD) methods based on hand-tagged examples are the ones performing best.

2000

(Kilgarriff & Rosenzweig, 2000) ⇒ Adam Kilgarriff, and Joseph Rosenzweig (2000). "Framework and Results for English SENSEVAL". Computers and the Humanities, 34(1-2), 15-48. DOI:10.1023/A:1002693207386
- QUOTE: Senseval was the first open, community-based evaluation exercise for Word Sense Disambiguation programs. It adopted the quantitative approach to evaluation developed in MUC and other ARPA evaluation exercises. It took place in 1998. In this paper we describe the structure, organisation and results of the SENSEVAL exercise for English. We present and defend various design choices for the exercise, describe the data and gold-standard preparation, consider issues of scoring strategies and baselines, and present the results for the 18 participating systems. The exercise identifies the state-of-the-art for fine-grained word sense disambiguation, where training data is available, as 74–78% correct, with a number of algorithms approaching this level of performance. For systems thatdid not assume the availability of training data, performance was markedly lower and also more variable. Human inter-tagger agreement was high, with the gold standard taggings being around 95% replicable.

1998

(Kilgarriff, 1998) ⇒ Adam Kilgarriff (1998). "SENSEVAL: An Exercise in Evaluating Word Sense Disambiguation Programs". In: Proc. EURALEX-98, Liege, Belgium (pp 176–174), and Proc. LREC-1, Granada, Spain (pp 581–588).
- QUOTE: A call for participants has been published and there are over 20 systems (hereafter “the participants”), from three continents, planning to take part. Participation involves, minimally,

1. receiving corpus data from the organisers

2. applying the participant’s WSD program to it

3. returning the program's word sense decisions to the organisers for evaluation.

This will take place over the summer, 1998, and there will be a workshop in Sussex, England, in September, by which time the performance of a number of WSD programs will have been evaluated, and where we shall discuss

systems’ results (from different sites, for different words etc.)
the difficulties faced by the human lexicographers/taggers
the way forward.

[1] Blackburn, P., and Bos, J. (2005), Representation and Inference for Natural Language: A First Course in Computational Semantics, CSLI Publications. .

[1]