2014 CorpusAnnotationthroughCrowdsou
- (Sabou et al., 2014) ⇒ Marta Sabou, Kalina Bontcheva, Leon Derczynski, and Arno Scharl. (2014). “Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines.” In: Proc. LREC.
Subject Headings: Crowdsourced Corpus Annotation
Notes
Cited By
Quotes
Author Keywords
Abstract
Crowdsourcing is an emerging collaborative approach that can be used for the acquisition of annotated corpora and a wide range of other linguistic resources. Although the use of this approach is intensifying in all its key genres (paid-for crowdsourcing, games with a purpose, volunteering-based approaches), the community still lacks a set of best-practice guidelines similar to the annotation best practices for traditional, expert-based corpus acquisition. In this paper we focus on the use of crowdsourcing methods for corpus acquisition and propose a set of best practice guidelines based in our own experiences in this area and an overview of related literature. We also introduce GATE Crowd, a plugin of the GATE platform that relies on these guidelines and offers tool support for using crowdsourcing in a more principled and efficient manner.
1. Introduction
Over the past ten years, Natural Language Processing (NLP) research has been driven forward by a growing volume of annotated corpora, produced by evaluation initiatives such as ACE (ACE, 2004), TAC,[1] SemEval and Senseval, [2] and large annotation projects such as OntoNotes (Hovy et al., 2006). These corpora have been essential for training and domain adaptation of NLP algorithms and their quantitative evaluation, as well as for enabling algorithm comparison and repeatable experimentation. Thanks to these efforts, there are now well-understood best practices in how to create annotations of consistently high quality, by employing, training, and managing groups of linguistic and/or domain experts. This process is referred to as “the science of annotation” (Hovy, 2010).
More recently, the emergence of crowdsourcing platforms (e.g. paid-for marketplaces such as Amazon Mechanical Turk (AMT) and CrowdFlower (CF); games with a purpose; and volunteer-based platforms such as crowdcrafting), coupled with growth in internet connectivity, motivated NLP researchers to experiment with crowdsourcing as a novel, collaborative approach for obtaining linguistically annotated corpora. The advantages of crowdsourcing over expert-based annotation have already been discussed elsewhere (Fort et al., 2011; Wang et al., 2012), but in a nutshell, crowdsourcing tends to be cheaper and faster.
There are now a large and continuously growing number of papers, which have used crowdsourcing in order to create annotated data for training and testing a wide range of NLP algorithms, as detailed in Section 2. and listed in Table 1. As the practice of using crowdsourcing for corpus annota- tion has become more widespread, so has the need for a best practice synthesis, spanning all three crowdsourcing genres and generalising from the specific NLP annotation task re- ported in individual papers. The meta-review of (Wang et al., 2012) discusses the trade-offs of the three crowdsourcing genres, alongside dimensions such as contributor motivation, setup effort, and human participants. While this review answers some key questions in using crowdsourcing, it does not provide a summary of best practice in how to setup, execute, and manage a complete crowdsourcing annotation project. In this paper we aim to address this gap by putting forward a set of best practice guidelines for crowdsourced corpus acquisition (Section 3.) and introducing GATE Crowd, an extension of the GATE NLP platform that facilitates the creation of crowdsourced tasks based on best practices and their integration into larger NLP processes (Section 4.).
2. Crowdsourcing Approaches
Crowdsourcing paradigms for corpus creation can be placed into one of three categories: mechanised labour, where workers are rewarded financially; games with a purpose, where the task is presented as a game; and altruistic work, relying on goodwill.
Mechanised labour has been used to create corpora that support a broad range of NLP problems (Table 1). Highly popular are NLP problems that are inherently subjective and cannot yet be reliably solved automatically, such as sentiment and opinion mining (Mellebeek et al., 2010), word sense disambiguation (Parent and Eskenazi, 2010), textual entailment (Negri et al., 2011), question answering (Heilman and Smith, 2010). Others create corpora of special resource types such as emails (Lawson et al., 2010), twitter feeds (Finin et al., 2010), augmented and alternative communication texts (Vertanen and Kristensson, 2011).
One advantage of crowdsourcing is “access to foreign markets with native speakers of many rare languages” (Zaidan and Callison-Burch, 2011). This feature is particularly useful for those that work on less-resourced languages such as Arabic (El-Haj et al., 2010) and Urdu (Zaidan and Callison-Burch, 2011). Irvine and Klementiev (2010) demonstrated that it is possible to create lexicons between English and 37 out of the 42 low-resource languages they examined. Games with a purpose (GWAPs) for annotation include Phratris (annotating sentences with syntactic dependen- cies) (Attardi, 2010), PhraseDetectives (Poesio et al., 2012) (anaphora annotations), and Sentiment Quiz (Scharl et al., 2012) (sentiment). GWAP-based approaches for collecting speech data include VoiceRace (McGraw et al., 2009), a GWAP+MTurk approach, where participants see a definition on a flashcard and need to guess and speak the corresponding word, which is then transcribed automatically by a speech recognizer; VoiceScatter (Gruenstein et al., 2009), where players must connect word sets with their definitions; Freitas et al.’s GWAP (Freitas et al., 2010), where players speak answers to graded questions in different knowledge domains; and MarsEscape (Chernova et al., 2010), a two-player game for collecting large-scale data for human-robot interaction.
An early example of leveraging volunteer contributions is Open Mind Word Expert , a Web interface that allows volunteers to tag words with their appropriate sense from WordNet in order to collect training data for the Senseval campaigns (Chklovski and Mihalcea, 2002). Also, the MNH (“Translation for all”) platform tries to foster the formation of a community through functionalities such as social networking and group definition support (Abekawa et al., 2010). Lastly, crowdcrafting.org is a community platform where NLP-based applications can be deployed.
Notably, volunteer projects that have not been conceived with a primary NLP interest but which delivered results that are useful in solving NLP problems are (i) Wikipedia, (ii) The Open Mind Common Sense project for collecting general world knowledge from volunteers in multiple languages, a key source for the ConceptNet semantic network that can enable various text understanding tasks; (iii) or Freebase a structured, graph-based knowledge repository offering information about almost 22 million entities constructed both by automatic means but also through contributions from thousands of volunteers.
3. Best Practice Guidelines
Conceptually, the process of crowdsourcing language resources can be broken down into four main stages, outlined in Figure 3. and discussed in the following subsections. These stages have been identified based on generalising our experience with crowdsourced corpus acquisition (Rafelsberger and Scharl, 2009; Scharl et al., 2012; Sabou et al., 2013a; Sabou et al., 2013b) and a meta-analysis of other crowdsourcing projects summarized in Table 1
…
References
- Abekawa, T., Utiyama, M., Sumita, E., and Kageura, K. (2010). Community-based Construction of Draft and Final Translation Corpus through a Translation Hosting Site Minna no Hon’yaku (MNH). In: Proceedings LREC
- ACE, (2004). Annotation Guidelines for Event Detection and Characterization (EDC), Feb. Available at http://www.ldc.upenn.edu/Projects/ACE/. Aker, A., El-Haj, M., Albakour, M.-D., and Kruschwitz, U. (2012). Assessing crowdsourcing quality through objective tasks. In: Proceedings LREC , pages 1456–1461.
- Attardi, G. (2010). Phratris – A Phrase Annotation Game. In INSEMTIVES Game Idea Challenge
- Behrend, T., Sharek, D., Meade, A., and Wiebe, E. (2011). The viability of crowdsourcing for survey research. Behav. Res. , 43(3).
- Biewald, L. (2012). Massive multiplayer human computation for fun, money, and survival. In Current Trends in Web Engineering , pages 171–176. Springer. Bontcheva, K., Derczynski, L., and Roberts, I. (2014a).
- Crowdsourcing named entity recognition and entity linking corpora. In Handbook of Linguistic Annotation
- Bontcheva, K., Roberts, I., and Derczynski, L. (2014b). The GATE Crowdsourcing Plugin: Crowdsourcing Annotated Corpora Made Easy. In: Proceedings EACL
- Callison-Burch, C. and Dredze, M., editors. (2010). Proceedings of of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk
- Chamberlain, J., Poesio, M., and Kruschwizt, U. (2009). A new life for a dead parrot: Incentive structures in the Phrase Detectives game. In: Proceedings of the Webcentives Workshop
- Chernova, S., Orkin, J., and Breazeal, C. (2010). Crowdsourcing HRI through Online Multiplayer Games. In Dialog with Robots: Papers from the AAAI Fall Symposium (FS-10-05)
- Chklovski, T. and Mihalcea, R. (2002). Building a Sense Tagged Corpus with Open Mind Word Expert. In: Proceedings of the ACL Workshop on Word Sense Disambiguation: Recent Successes and Future Directions
- Cooper, S., Khatib, F., Treuille, A., Barbero, J., Lee, J., Beenen, M., Leaver-Fay, A., Baker, D., Popovic, Z., and players, F. (2010). Predicting protein structures with a multiplayer online game. Nature , 466(7307).
- Cunningham, H., Tablan, V., Roberts, A., and Bontcheva, K. (2013). Getting more out of biomedical documents with GATE’s full lifecycle open source text analytics. PLoS computational biology , 9(2):e1002854.
- Doan, A., Ramakrishnan, R., and Halevy, A. Y. (2011). Crowdsourcing Systems on the World-Wide Web. Commun. ACM , 54(4), April. El-Haj, M., Kruschwitz, U., and Fox, C. (2010). Using Mechanical Turk to Create a Corpus of Arabic Summaries. In: Proceedings LREC
- Feng, D., Besana, S., and Zajac, R. (2009). Acquiring High Quality Non-Expert Knowledge from On-Demand Workforce. In: Proceedings of The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources
- Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., and Dredze, M. (2010). Annotating Named Entities in Twitter Data with Crowdsourcing. In Callison-Burch and Dredze (Callison-Burch and Dredze, 2010).
- Fort, K. and Sagot, B. (2010). Influence of Pre-annotation on POS-tagged Corpus Development. In: Proceedings of the Fourth Linguistic Annotation Workshop
- Fort, K., Adda, G., and Cohen, K. (2011). Amazon Mechanical Turk: Gold Mine or Coal Mine? Computational Linguistics , 37(2):413 –420.
- Freitas, J., Calado, A., Braga, D., Silva, P., and Dias, M. (2010). Crowdsourcing platform for large-scale speech data collection. Proceedings of FALA, Vigo
- Gruenstein, E., Mcgraw, I., and Sutherl, A. (2009). A Self-Transcribing Speech Corpus: Collecting Continuous Speech with an Online Educational Game. In: Proceedings of The Speech and Language Technology in Education (SLaTE) Workshop
- Heilman, M. and Smith, N. A. (2010). Rating Computer-Generated Questions with Mechanical Turk. In Callison- Burch and Dredze (Callison-Burch and Dredze, 2010).
- Hong, J. and Baker, C. F. (2011). How Good is the Crowd at ”real” WSD? In: Proceedings of the 5th Linguistic Annotation Workshop
- Hovy, E., Marcus, M. P., Palmer, M., Ramshaw, L. A., and Weischedel, R. M. (2006). OntoNotes: The 90% Solution. In: Proceedings NAACL
- Hovy, D., Berg-Kirkpatrick, T., Vaswani, A., and Hovy, E. (2013). Learning Whom to trust with MACE. In: Proceedings of NAACL-HLT , pages 1120–1130.
- Hovy, E. (2010). Annotation. In Tutorial Abstracts of ACL
- Irvine, A. and Klementiev, A. (2010). Using Mechanical Turk to Annotate Lexicons for Less Commonly Used Languages. In Callison-Burch and Dredze (Callison- Burch and Dredze, 2010).
- Jha, M., Andreas, J., Thadani, K., Rosenthal, S., and McKeown, K. (2010). Corpus Creation for New Genres: A Crowdsourced Approach to PP Attachment. In Callison- Burch and Dredze (Callison-Burch and Dredze, 2010).
- Kawrykow, A., Roumanis, G., Kam, A., Kwak, D., Leung, C., Wu, C., Zarour, E., and players, P. (2012). Phylo: A Citizen Science Approach for Improving Multiple Sequence Alignment. PLoS ONE , 7(3):e31362.
- Khanna, S., Ratan, A., Davis, J., and Thies, W. (2010). Evaluating and improving the usability of Mechanical Turk for low-income workers in India. In: Proceedings of the first ACM symposium on Computing for Development . ACM. Kittur, A., Chi, E. H., and Suh, B. (2008). Crowdsourcing User Studies with Mechanical Turk. In: Proceedings of the 26th Conference on Human Factors in Computing Systems
- Laws, F., Scheible, C., and Sch ¨ utze, H. (2011). Active Learning with Amazon Mechanical Turk. In: Proceedings EMNLP
- Lawson, N., Eustice, K., Perkowitz, M., and Yetisgen- Yildiz, M. (2010). Annotating Large Email Datasets for Named Entity Recognition with Mechanical Turk. In Callison-Burch and Dredze (Callison-Burch and Dredze, 2010).
- Mason, W. and Watts, D. J. (2010). Financial incentives and the performance of crowds. ACM SigKDD Explorations Newsletter , 11(2):100–108.
- McCreadie, R., Macdonald, C., and Ounis, I. (2012). Identifying Top News Using Crowdsourcing. Information Retrieval . 10.1007/s10791-012-9186-z.
- McGraw, I., Gruenstein, A., and Sutherland, A. (2009). A Self-Labeling Speech Corpus: Collecting Spoken Words with an Online Educational Game. In: Proceedings of INTERSPEECH
- Mellebeek, B., Benavent, F., Grivolla, J., Codina, J., Costajuss ` a, M. R., and Banchs, R. (2010). Opinion Mining of Spanish Customer Comments with Non-Expert Annotations on Mechanical Turk. In Callison-Burch and Dredze (Callison-Burch and Dredze, 2010).
- Munro, R., Bethard, S., Kuperman, V., Lai, V. T., Melnick, R., Potts, C., Schnoebelen, T., and Tily, H. (2010). Crowdsourcing and Language Studies: The New Generation of Linguistic Data. In Callison-Burch and Dredze (Callison-Burch and Dredze, 2010).
- Negri, M. and Mehdad, Y. (2010). Creating a Bi-lingual Entailment Corpus through Translations with Mechanical Turk : 100 for a 10-day Rush. In Callison-Burch and Dredze (Callison-Burch and Dredze, 2010).
- Negri, M., Bentivogli, L., Mehdad, Y., Giampiccolo, D., and Marchetti, A. (2011). Divide and Conquer: Crowdsourcing the Creation of Cross-Lingual Textual Entailment Corpora. In: Proceedings EMNLP
- Parent, G. and Eskenazi, M. (2010). Clustering Dictionary Definitions Using Amazon Mechanical Turk. In Callison-Burch and Dredze (Callison-Burch and Dredze, 2010).
- Poesio, M., Kruschwitz, U., Chamberlain, J., Robaldo, L., and Ducceschi, L. (2012). Phrase Detectives: Utilizing Collective Intelligence for Internet-Scale Language Resource Creation. Transactions on Interactive Intelligent Systems
- Poesio, M., Chamberlain, J., and Kruschwitz, U. (2014). Crowdsourcing. In Handbook of Linguistic Annotation
- Springer. Rafelsberger, W. and Scharl, A. (2009). Games with a Purpose for Social Networking Platforms. In: Proceedings ACM conference on Hypertext and Hypermedia
- Rosenthal, S., Lipovsky, W., McKeown, K., Thadani, K., and Andreas, J. (2010). Towards Semi-Automated Annotation for Prepositional Phrase Attachment. In: Proceedings LREC
- Sabou, M., Bontcheva, K., Scharl, A., and F ¨ ols, M. (2013a). Games with a Purpose or Mechanised Labour? A Comparative Study. In: Proceedings International Conference on Knowledge Management and Knowledge Technologies
- Sabou, M., Scharl, A., and F ¨ ols, M. (2013b). Crowdsourced Knowledge Acquisition: Towards Hybrid-Genre Workflows. International Journal on Semantic Web and Information Systems , 9(3).
- Sayeed, A. B., Rusk, B., Petrov, M., Nguyen, H. C., Meyer, T. J., and Weinberg, A. (2011). Crowdsourcing syntactic relatedness judgements for opinion mining in the study of information technology adoption. In: Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH ’11)
- Scharl, A., Sabou, M., Gindl, S., Rafelsberger, W., and Weichselbraun, A. (2012). Leveraging the wisdom of the crowds for the acquisition of multilingual language resources. In: Proceedings LREC
- Snow, R., O’Connor, B., Jurafsky, D., and Ng, A. Y. (2008). Cheap and Fast — but is it Good?: Evaluating Non-Expert Annotations for Natural Language Tasks. In: Proceedings EMNLP
- Vertanen, K. and Kristensson, P. O. (2011). The Imagination of Crowds: Conversational AAC Language Modeling using Crowdsourcing and Large Data Sources. In: Proceedings EMNLP
- Voyer, R., Nygaard, V., Fitzgerald, W., and Copperman, H. (2010). A Hybrid Model for Annotating Named Entity Training Corpora. In: Proceedings of the Fourth Linguistic Annotation Workshop (LAW IV ’10)
- Wang, A., Hoang, C., and Kan, M. Y. (2012). Perspectives on Crowdsourcing Annotations for Natural Language Processing. Language Resources and Evaluation
- Yano, T., Resnik, P., and Smith, N. A. (2010). Shedding (a Thousand Points of) Light on Biased Language. In Callison-Burch and Dredze (Callison-Burch and Dredze, 2010).
- Yetisgen-Yildiz, M., Solti, I., Xia, F., and Halgrim, S. R. (2010). Preliminary Experience with Amazon’s Mechanical Turk for Annotating Medical Named Entities. In Callison-Burch and Dredze (Callison-Burch and Dredze, 2010).
- Zaidan, O. F. and Callison-Burch, C. (2011). Crowdsourcing Translation: Professional Quality from NonProfessionals. In: Proceedings ACL;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2014 CorpusAnnotationthroughCrowdsou | Kalina Bontcheva Marta Sabou Leon Derczynski Arno Scharl | Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines | 2014 |