Open-Ended Text Generation Task

From GM-RKB
Jump to navigation Jump to search

An Open-Ended Text Generation Task is a text generation task with flexibility in the resulting text length and text style.



References

2022

  • (Butlerean et al., 2022) ⇒ Rhys S. Butlerean, Vishnu Dutt Duggirala, and Farnoush Banaei-Kashani. (2022). “Efficient and Accurate Closed-Domain and Open-Domain Long-Form Question Answering.” In: International Conference on Computer Supported Education.
    • ABSTRACT: We present an efficient and accurate long-form question answering platform, dubbed iLFQA (i.e., short for intelligent Long-Form Question Answering). iLFQA was created as a continuation of iTA (i.e., short for intelligent Teaching Assistant). iTA was originally designed as a narrow domain question answering platform that performed generative question answering with a single textbook as a reference. The core purpose of iTA was expanded into iLFQA as we attempted to expand the narrow domain of iTA into an open-domain question answering system. iLFQA functions as a platform that accepts unscripted questions and efficiently produces semantically meaningful, explanatory, and accurate long-form responses. iLFQA uses classification tools as well as Transformer-based text generation modules, and is unique in the question answering space because it is an example of a deployable and efficient long-form question answering system. Question Answering systems exist in many forms, but long-form question answering remains relatively unexplored. The source code for both iLFQA and iTA are freely available for the benefit of researchers and practitioners in this field.

2021

  • (Karpinska et al., 2021) ⇒ Marzena Karpinska, Nader Akoury, and Mohit Iyyer. (2021). “The Perils of Using Mechanical Turk to Evaluate Open-ended Text Generation.” arXiv preprint arXiv:2109.06835
    • ABSTRACT: Recent text generation research has increasingly focused on open-ended domains such as story and poetry generation. Because models built for such tasks are difficult to evaluate automatically, most researchers in the space justify their modeling choices by collecting crowdsourced human judgments of text quality (e.g., Likert scores of coherence or grammaticality) from Amazon Mechanical Turk (AMT). In this paper, we first conduct a survey of 45 open-ended text generation papers and find that the vast majority of them fail to report crucial details about their AMT tasks, hindering reproducibility. We then run a series of story evaluation experiments with both AMT workers and English teachers and discover that even with strict qualification filters, AMT workers (unlike teachers) fail to distinguish between model-generated text and human-generated references. We show that AMT worker judgments improve when they are shown model-generated output alongside human-generated references, which enables the workers to better calibrate their ratings. Finally, interviews with the English teachers provide deeper insights into the challenges of the evaluation process, particularly when rating model-generated text.