Distant-Supervision Learning Algorithm
Jump to navigation
Jump to search
A Distant-Supervision Learning Algorithm is a semi-supervised learning algorithm that makes uses a weakly labeled training set (based on a heuristic labeling function - typically relying on a knowledge base).
- Context:
- Counter-Example(s):
- See: Supervised Learning Algorithm.
References
2013
- (Min et al., 2013) ⇒ Bonan Min, Ralph Grishman, Li Wan, Chang Wang, and David Gondek. (2013). “Distant Supervision for Relation Extraction with An Incomplete Knowledge Base.” In: HLT-NAACL, pp. 777-782.
2010
- (Riedel et al., 2010) ⇒ Sebastian Riedel, Limin Yao, and Andrew McCallum. (2010). “Modeling Relations and their Mentions without Labeled Text.” In: Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases.
- http://dl.acm.org/citation.cfm?id=1889799
- QUOTE: Several recent works on relation extraction have been applying the distant supervision paradigm: instead of relying on annotated text to learn how to predict relations, they employ existing knowledge bases (KBs) as source of supervision. Crucially, these approaches are trained based on the assumption that each sentence which mentions the two related entities is an expression of the given relation. Here we argue that this leads to noisy patterns that hurt precision, in particular if the knowledge base is not directly related to the text we are working with. We present a novel approach to distant supervision that can alleviate this problem based on the following two ideas: First, we use a factor graph to explicitly model the decision whether two entities are related, and the decision whether this relation is mentioned in a given sentence; second, we apply constraint-driven semi-supervision to train this model without any knowledge about which sentences express the relations in our training KB. We apply our approach to extract relations from the New York Times corpus and use Freebase as knowledge base. When compared to a state-of-the-art approach for relation extraction under distant supervision, we achieve 31% error reduction.
2009
- (2009_DistantSupervisionForRE) ⇒ Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. (2009). “Distant Supervision for Relation Extraction Without Labeled Data.” In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (ACL 2009).
- QUOTE: Modern models of relation extraction for tasks like ACE are based on supervised learning of relations from small hand-labeled corpora. We investigate an alternative paradigm that does not require labeled corpora, avoiding the domain dependence of ACE-style algorithms, and allowing the use of corpora of any size. Our experiments use Freebase, a large semantic database of several thousand relations, to provide distant supervision. For each pair of entities that appears in some Freebase relation, we find all sentences containing those entities in a large unlabeled corpus and extract textual features to train a relation classifier. … We propose an alternative paradigm, distant supervision, that combines some of the advantages of each of these approaches. Distant supervision is an extension of the paradigm used by Snow et al. (2005) for exploiting WordNet to extract hypernym (is-a) relations between entities, and is similar to the use of weakly labeled data in bioinformatics (Craven and Kumlien, 1999; Morgan et al., 2004). Our algorithm uses Freebase (Bollacker et al., 2008), a large semantic database, to provide distant supervision for relation extraction. Freebase contains 116 million instances of 7,300 relations between 9 million entities. The intuition of distant supervision is that any sentence that contains a pair of entities that participate in a known Freebase relation is likely to express that relation in some way. Since there may be many sentences containing a given entity pair, we can extract very large numbers of (potentially noisy) features that are combined in a logistic regression classifier.
2007
- (Wu & Weld, 2007) ⇒ Fei Wu, and Daniel S. Weld. (2007). “Autonomously Semantifying Wikipedia.” In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management (CIKM 2007). doi:10.1145/1321440.1321449
2004
- (Snow et al., 2004) ⇒ Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. (2004). “Learning Syntactic Patterns for Automatic Hypernym Discovery.” In: Advances in Neural Information Processing Systems 17. (NIPS 2004).
- (Morgan et al., 2004) ⇒ Alexander A. Morgan, Lynette Hirschman, Marc E. Colosimo, Alexander S. Yeh, and Jeff B. Colombe. (2004). “Gene Name Identification and Normalization Using a Model Organism Database.” In: Journal of Biomedical Informatics 37(6). doi:10.1016/j.jbi.2004.08.010
- QUOTE: … To provide training data, we were able to reverse engineer the gene lists from the associated articles and abstracts, to generate text labeled (imperfectly) with gene mentions.
1999
- (Craven & Kumlien, 1999) ⇒ Mark Craven, and Johan Kumlien. (1999). “Constructing Biological Knowledge-bases by Extracting Information from Text Sources.” In: Proceedings of the International Conference on Intelligent Systems for Molecular Biology.
- QUOTE: … we present an approach to learning information extractors that relies on existing databases to provide something akin to labeled training instances. Our approach is motivated by the observation that, for many IE tasks, there are existing information sources (knowledge bases, databases, or even simple lists or tables) that can be coupled with documents to provide what we term "weakly" labeled training examples. We call this form of training data weakly labeled because each instance consists not of a precisely marked document, but instead it consists of a fact to be extracted along with a document that may assert the fact. … In this section we evaluate the utility of learning from weakly labeled training instances. …