2023 SyntheticallyGeneratedTextforSu
- (Halterman, 2023) ⇒ Andrew Halterman. (2023). “Synthetically Generated Text for Supervised Text Analysis.” doi:10.48550/arXiv.2303.16028
Subject Headings: Synthetic Text Generation.
Notes
- It proposes using synthetic text generation with language models to lower costs of supervised text analysis like labeling and sharing text.
- It guides on adapting models versus prompting for controlled text generation based on the research goal.
- It introduces an "adversarial method" to quantitatively evaluate and maximize similarity of synthetic text to real text.
- It demonstrates applications for synthetic text like tweets on Ukraine war, news articles for event detection, and multilingual populist manifestos.
- It highlights limitations in using party manifestos to study populism, raising questions around incentives.
- It discusses ethical concerns like factual inaccuracies with synthetic text and provides guidelines for handling it.
- It suggests future work like hybrid real/synthetic datasets, improving quality, and theoretical work on party manifesto incentives.
- It highlights a method to enhance the quality of synthetically generated text, emphasizing the balance between the benefits of synthetic text and the tradeoffs involved in its generation.
- It provides insights into optimizing synthetic text for supervised text analysis.
Cited By
Quotes
Abstract
Supervised text models are a valuable tool for political scientists but present several obstacles to their use, including the expense of hand-labeling documents, the difficulty of retrieving rare relevant documents for annotation, and copyright and privacy concerns involved in sharing annotated documents. This article proposes a partial solution to these three issues, in the form of controlled generation of synthetic text with large language models. I provide a conceptual overview of text generation, guidance on when researchers should prefer different techniques for generating synthetic text, a discussion of ethics, and a simple technique for improving the quality of synthetic text. I demonstrate the usefulness of synthetic text with three applications: generating synthetic tweets describing the fighting in Ukraine, synthetic news articles describing specified political events for training an event detection system, and a multilingual corpus of populist manifesto statements for training a sentence-level populism classifier.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2023 SyntheticallyGeneratedTextforSu | Andrew Halterman | Synthetically Generated Text for Supervised Text Analysis | 10.48550/arXiv.2303.16028 | 2023 |