2008 StructuredEntityIdentForDocCateg
Jump to navigation
Jump to search
- (Bhattacharya et al., 2008) ⇒ Indrajit Bhattacharya, Shantanu Godbole, and Sachindra Joshi. (2008). “Structured entity identification and document categorization: two tasks with one joint model." In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2008) [doi>10.1145/1401890.1401899].
Subject Headings: EM Algorithm.
Notes
- Input
- Given a set of documents that each describe one (central) entity. All entities have the same attributes and a special class attribute, e.g. movies of different genres where the genre defines the movie class, but all movies have the same descriptive attributes. It is assumed that the document contains all the attribute values for that entity plus some additional words from a distribution that depends on the entity class.
- Goal
- For a given document, identify the central entity (e.g., the movie reviewed) and its entity class (e.g., its genre).
- Idea
- The tasks of entity identification and categorization / classification can benefit from each other, and the authors propose an integrated approach to exploit this observation.
- Probabilistic model
- It uses a natural and sophisticated probabilistic model.
- A document is assumed to consist of a structured part with attribute values and an unstructured part containing a bag of words.
- A probabilistic generative model is proposed with random variables t, e, w, c, and a. In this model, the entity type (t) determines (in a probabilistic manner) the actual entity (e) and the set of words (w) in the unstructured parts. The entity and the column(c) / attribute determine (in a probabilistic manner that models noise) the actual value (a) of that attribute.
- Challenge
- No labeled training data available for the word distributions of the different entity classes.
- Parameter estimation using EM
- t, e, and c are hidden variables. Using EM, they find parameter values that maximize the
probability / likelihood of the observed a (attribute) and [math]\displaystyle{ w }[/math] (word) values.
- Experimental evaluation
- Method has been applied very successfully to some proprietary CRM email and call logs dataset.
- It experimetns on public domain IMDB Dataset.
- The experimental results are disappointing given the sophistication of the approach. The proposed method does not gain too much compared to the baseline methods. Part of the reason seems to be that movies tend to belong to multiple genres, but the authors had to assign each movie to a unique genre as required by their method.
- Discussion
- Relating the problem of this paper to biomedical relationship extraction, do biomedical papers have one central entity? Do entities such as proteins have a fixed set of attributes? As mentioned already, it seems that we could consider the localization of a protein as its class, but the class can probably not be accurately predicted from the text of the paper (needs protein sequence data).
Cited By
Quotes
Abstract
- Traditionally, research in identifying structured entities in documents has proceeded independently of document categorization research. In this paper, we observe that these two tasks have much to gain from each other. Apart from direct references to entities in a database, such as names of person entities, documents often also contain words that are correlated with discriminative entity attributes, such age-group and income-level of persons. This happens naturally in many enterprise domains such as CRM, Banking, etc. Then, entity identification, which is typically vulnerable against noise and incompleteness in direct references to entities in documents, can benefit from document categorization with respect to such attributes. In return, entity identification enables documents to be categorized according to different label-sets arising from entity attributes without requiring any supervision. In this paper, we propose a probabilistic generative model for joint entity identification and document categorization. We show how the parameters of the model can be estimated using an EM algorithm in an unsupervised fashion. Using extensive experiments over real and semi-synthetic data, we demonstrate that the two tasks can benefit immensely from each other when performed jointly using the proposed model.
,