2013 TextbasedMeasuresofDocumentDive

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Text Diversity Analysis.

Notes

Cited By

Quotes

Author Keywords

Abstract

Quantitative notions of diversity have been explored across a variety of disciplines ranging from conservation biology to economics. However, there has been relatively little work on measuring the diversity of text documents via their content. In this paper we present a text-based framework for quantifying how diverse a document is in terms of its content. The proposed approach learns a topic model over a corpus of documents, and computes a distance matrix between pairs of topics using measures such as topic co-occurrence. These pairwise distance measures are then combined with the distribution of topics within a document to estimate each document's diversity relative to the rest of the corpus. The method provides several advantages over existing methods. It is fully data-driven, requiring only the text from a corpus of documents as input, it produces human-readable explanations, and it can be generalized to score diversity of other entities such as authors, academic departments, or journals. We describe experimental results on several large data sets which suggest that the approach is effective and accurate in quantifying how diverse a document is relative to other documents in a corpus.

References

  • 1. R. N. Broadus. An investigation of the validity of bibliographic citations. Journal of the American Society for Information Science, 34(2):132--135, 2007.
  • 2. D. Davies. Citation Idiosyncrasies. Nature, 228:1356, 1970.
  • 3. J. Dillon, Y. Mao, G. Lebanon, and J. Zhang. Statistical Translation, Heat Kernels and Expected Distances. In: Proceedings of the Uncertainty in AI Conference (UAI 2007), Pages 93--100, 2007.
  • 4. M. O. Finkelstein and R. M. Friedberg. The Application of An Entropy Theory of Concentration to the Clayton Act. Yale Law Journal, 76:677, 1966.
  • 5. J. Gibbs and W. Martin. Urbanization, Technology, and the Division of Labor: International Patterns. American Sociological Review, Pages 667--677, 1962.
  • 6. Jennifer Gillenwater, Alex Kulesza, Ben Taskar, Discovering Diverse and Salient Threads in Document Collections, Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, July 12-14, 2012, Jeju Island, Korea
  • 7. T. L. Griffiths and M. Steyvers. Finding Scientific Topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl 1):5228--5235, 2004.
  • 8. S. Lieberson. Measuring Population Diversity. American Sociological Review, Pages 850--862, 1969.
  • 9. A. Magurran and A. Magurran. Ecological Diversity and Its Measurement, Volume 168. Princeton University Press, Princeton, NJ, 1988.
  • 10. A. K. McCallum. Mallet: A Machine Learning for Language Toolkit. http://www.cs.umass.edu/ Mccallum/mallet, 2002.
  • 11. National Center for Biotechnology Information, U.S. National Library of Medicine. Pubmed Central Open Access Initiative. 2010. http://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/.
  • 12. M. Nei. Analysis of Gene Diversity in Subdivided Populations. Proceedings of the National Academy of Sciences, 70(12):3321--3323, 1973.
  • 13. E. C. Pielou. An Introduction to Mathematical Ecology. Wiley-Interscience, 1969.
  • 14. A. L. Porter and I. Rafols. Is Science Becoming More Interdisciplinary? Measuring and Mapping Six Research Fields over Time. Scientometrics, 81(3):719--745, 2009.
  • 15. A. L. Porter, D. J. Roessner, and A. E. Heberger. How Interdisciplinary is a Given Body of Research? Research Evaluation, 17(4):273--282, 2008.
  • 16. D. Radev, P. Muthukrishnan, V. Qazvinian, and A. Abu-Jbara. The ACL Anthology Network Corpus. Language Resources and Evaluation, Pages 1--26, 2013.
  • 17. I. Rafols and M. Meyer. Diversity and Network Coherence As Indicators of Interdisciplinarity: Case Studies in Bionanoscience. Scientometrics, 82(2):263--287, 2010.
  • 18. C. Rao. Diversity and Dissimilarity Coefficients: A Unified Approach. Theoretical Population Biology, 21(1):24--43, 1982.
  • 19. C. Ricotta and L. Szeidl. Towards a Unifying Approach to Diversity Measures: Bridging the Gap Between the Shannon Entropy and Rao's Quadratic Index. Theoretical Population Biology, 70(3):237--243, 2006.
  • 20. E. Simpson. Measurement of Diversity. Nature, Page 688, 1949.
  • 21. A. Solow, S. Polasky, and J. Broadus. On the Measurement of Biological Diversity. Journal of Environmental Economics and Management, 24(1):60--68, 1993.
  • 22. A. Stirling. A General Framework for Analysing Diversity in Science, Technology and Society. Journal of the Royal Society Interface, 4(15):707--719, 2007.
  • 23. C. Wagner, J. Roessner, K. Bobb, J. Klein, K. Boyack, J. Keyton, I. Rafols, and K. Börner. Approaches to Understanding and Measuring Interdisciplinary Scientific Research (IDR): A Review of the Literature. Journal of Informetrics, 5(1):14--26, 2011.
  • 24. Michael J. Welch, Junghoo Cho, Christopher Olston, Search Result Diversity for Informational Queries, Proceedings of the 20th International Conference on World Wide Web, March 28-April 01, 2011, Hyderabad, India doi:10.1145/1963405.1963441

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2013 TextbasedMeasuresofDocumentDivePadhraic Smyth
David Newman
Kevin Bache
Text-based Measures of Document Diversity10.1145/2487575.24876722013