2006 EffiLinkingTextDocs

(Chakaravarthy et al., 2006) ⇒ Venkatesan T. Chakaravarthy, Himanshu Gupta, Prasan Roy, Mukesh Mohania. (2006). “Efficiently Linking Text Documents with Relevant Structured Information.” In: Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB 2006).

Subject Headings: Entity Mention Normalization Algorithm, TF-IDF Ranking Function

Notes

It proposes an unsupervised algorithm, named EROCS, for an entity mention normalization task where Text Passages are linked to a single entity record.
Its proposed algorithm uses of a TF-IDF-like ranking function that restricts itself to the terms that are available to describe entities.
Its proposed algorithm requires that each passage be linked to at most one Entity Record. This restriction is acceptable for their scenario where each entity relates to a purchase transaction, because few Passages will discuss more than one Transaction.
It proposes a greedy iterative cache refinement strategy in order to reduce the the data retrieved from the entity database.
Its proposal is related to the Factoid QA Task, if the entity description is treated as the query and if the supporting passages is required as evidence. So, it would be interesting to test out the performance of the TF-IDF Vector Cosine Similarity approach used for the Factoid QA task.
Presentation Slides: http://aitrc.kaist.ac.kr/~vldb06/slides/R19-1.ppt

Cited By

~34 papers http://scholar.google.com/scholar?cluster=5244668485072194483

2009

(Dalvi et al., 2009) ⇒ Nilesh Dalvi, Ravi Kumar, Bo Pang, and Andrew Tomkins. (2009). “Matching Reviews to Objects Using a Language Model. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP 2009).
- QUOTE: The EROCS system (Chakaravarthy et al., 2006), which uses information extraction and entity matching, is closest in spirit to our problem; they, however, employ tf-idf to match, which we show to be significantly sub-optimal in our setting.

2008

(Michelson & Knoblock, 2008) ⇒ Matthew Michelson, and Craig A. Knoblock. (2008). “Creating Relational Data from Unstructured and Ungrammatical Data Sources.” In: Journal of Artificial Intelligence Research, 31.
- QUOTE: We note with interest the EROCS system (Chakaravarthy, Gupta, Roy, & Mohania, 2006) where the authors tackle the problem of linking full text documents with relational databases. The technique involves filtering out all non-nouns from the text, and then finding the matches in the database. This is an intriguing approach; interesting future work would involve performing a similar filtering for larger documents and then applying the Phoebus algorithm to match the remaining nouns to reference sets.

2007

(Bhide et al., 2007) ⇒ Manish A. Bhide, Ajay Gupta, Rahul Gupta, Prasan Roy, Mukesh K. Mohania, and Zenita Ichhaporia. (2007). “LIPTUS: associating structured and unstructured information in a banking environment.” Proceedings of the 2007 [[ACM SIGMOD] Conference.
- QUOTE: EROCS views the database as an set of entities, and identifies the entities that best match a given document – it performs the matching even if the identifier of the entity does not appear in the document text, and allows different segments in the document to match different entities.

Quotes

Abstract

Faced with growing knowledge management needs, enterprises are increasingly realizing the importance of interlinking critical business information distributed across structured and unstructured data sources. We present a novel system, called EROCS, for linking a given text document with relevant structured data. EROCS views the structured data as a predefined set of "entities" and identifies the entities that best match the given document. EROCS also embeds the identified entities in the document, effectively creating links between the structured data and segments within the document. Unlike prior approaches, EROCS identifies such links even when the relevant entity is not explicitly mentioned in the document. EROCS uses an efficient algorithm that performs this task keeping the amount of information retrieved from the database at a minimum. Our evaluation shows that EROCS achieves high accuracy with reasonable overheads.

References

1 Eugene Agichtein, Venkatesh Ganti, Mining reference tables for automatic text segmentation, Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 22-25, 2004, Seattle, WA, USA
2 AGRAWAL, S., CHAUDHURI, S., and DAS, G. DBXplorer: A System for Keyword-based Search over Relational databases. In ICDE (2002).
3 Ricardo A. Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1999
4 Thierry Barsalou, Gio Wiederhold, View objects for relational databases, 1990
5 Thierry Barsalou, Niki Siambela, Arthur M. Keller, Gio Wiederhold, Updating relational databases through object-based views, Proceedings of the 1991 ACM SIGMOD Conference, p.248-257, May 29-31, 1991, Denver, Colorado, United States
6 Arvind Hulgeri, Charuta Nakhe, Keyword Searching and Browsing in Databases using BANKS, Proceedings of the 18th International Conference on Data Engineering, p.431, February 26-March 01, 2002
7 BORTHWICK, A., STERLING, J., Eugene Agichtein, and GRISHMAN, R. Exploiting diverse sources via maximum entropy in named entity recognition. In Sixth Workshop on Very Large Corpora (1998).
8 Soumen Chakrabarti, Breaking through the syntax barrier: searching with entities and relations, Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, p.9-16, September 20-24, 2004, Pisa, Italy
9 Amit Chandel, P. C. Nagesh, Sunita Sarawagi, Efficient Batch Top-k Search for Dictionary-based Entity Recognition, Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), p.28, April 03-07, 2006 doi:10.1109/ICDE.2006.55
10 (Chaudhuri et al., 2005) ⇒ Surajit Chaudhuri, Venkatesh Ganti, Rajeev Motwani. (2005). “Robust Identification of Fuzzy Duplicates.” Proceedings of the 21st International Conference on Data Engineering (ICDE'05), p.865-876, April 05-08, 2005 doi:10.1109/ICDE.2005.125
11 Peter Pin-Shan Chen, The entity-relationship model — toward a unified view of data, ACM Transactions on Database Systems (TODS), v.1 n.1, p.9-36, March 1976 doi:10.1145/320434.320440
12 (CoSa, 2005) ⇒ William W. Cohen, Sunita Sarawagi. (2005). “Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods.” In: Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 22-25, 2004, Seattle, WA, USA doi:10.1145/1014052.1014065
13 AnHai Doan, Alon Y. Halevy, Semantic-integration research in the database community, AI Magazine, v.26 n.1, p.83-94, March 2005
14 HRISTIDIS, V., GRAVANO, L., and PAPAKONSTANTINOU, Y. Efficient IR-Style Keyword Search over Relational Databases. In VLDB (2003).
15 IBM. IBM DB2 UDB Net Search Extender : Administration and User Guide (version 8.1), 2003.
16 Xin Li, Paul Morie, Dan Roth, Semantic integration in text: from ambiguous names to identifiable entities, AI Magazine, v.26 n.1, p.45-58, March 2005
17 Imran R. Mansuri, Sunita Sarawagi, Integrating Unstructured Data into Relational Databases, Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), p.29, April 03-07, 2006 doi:10.1109/ICDE.2006.83
18 William J. Premerlani, Michael R. Blaha, An approach for reverse engineering of relational databases, Communications of the ACM, v.37 n.5, p.42-ff., May 1994 doi:10.1145/175290.175293
19 Prasan Roy, Mukesh Mohania, Bhuvan Bamba, Shree Raman, Towards automatic association of relevant unstructured content with structured query results, Proceedings of the 14th ACM International Conference on Information and knowledge management, October 31-November 05, 2005, Bremen, Germany doi:10.1145/1099554.1099676
20 Sunita Sarawagi Automation in information extraction and integration (tutorial). In VLDB (2002).
21 Jayavel Shanmugasundaram, Kristin Tufte, Chun Zhang, Gang He, David J. DeWitt, Jeffrey F. Naughton, Relational Databases for Querying XML Documents: Limitations and Opportunities, Proceedings of the 25th International Conference on Very Large Data Bases, p.302-314, September 07-10, 1999
22 Mark H. Walker, Nanette J. Eaton, Nanette Eaton, Microsoft Office Visio 2003 Inside Out, Microsoft Press, Redmond, WA, 2003
23 WINKLER, W. E. The state of record linkage and current research problems. Tech. rep., U.S. Census Bureau, (1999).

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2006 EffiLinkingTextDocs	Venkatesan T. Chakaravarthy Himanshu Gupta Prasan Roy Mukesh Mohania			Efficiently Linking Text Documents with Relevant Structured Information		Proceedings of the 32nd International Conference on Very Large Data Bases	http://www.vldb.org/conf/2006/p667-chakaravarthy.pdf			2006