1998 CombiningLabAndUnlabDataCotraining

From GM-RKB
(Redirected from Blum & Mitchell, 1998)
Jump to navigation Jump to search

Subject Headings: Semi-Supervised Learning Algorithm, Co-Training Algorithm.

Notes

Cited By

  • ~1,900++ …

2000

Quotes

Abstract

We consider the problem of using a large unlabeled sample to boost performance of a learning algorithm when only a small set of labeled examples is available. In particular, we consider a problem setting motivated by the task of learning to classify web pages, in which the description of each example can be partitioned into two distinct views. For example, the description of a web page can be partitioned into the words occurring on that page, and the words occurring in hyperlinks that point to that page. We assume that either view of the example would be sufficient for learning if we had enough labeled data, but our goal is to use both views together to allow inexpensive unlabeled data to augment, a much smaller set of labeled examples. Specifically, the presence of two distinct views of each example suggests strategies in which two learning algorithms are trained separately on each view, and then each algorithm’s predictions on new unlabeled examples are used to enlarge the training set of the other. Our goal in this paper is to provide a PAC-style analysis for this setting, and, more broadly, a PAC-style framework for the general problem of learning from both labeled and unlabeled data. We also provide empirical results on real web-page data indicating that this use of unlabeled examples can lead to significant improvement of hypotheses in practice.

CONCLUSIONS AND OPEN QUESTIONS

We have described a model in which unlabeled data can be used to augment labeled data, based on having two views (x1, x2) of an example that are redundant but not completely correlated. Our theoretical model is clearly an over-simplification of real-world target functions and distributions. In particular, even for the optimal pair of functions fl, f2 e Cl x C2 we would expect to occasionally see inconsistent examples (i.e., examples (X1,x2) such that f1(x1) != fz(~z)). Nonetheless, it provides a way of looking at the notion of the “friendliness” of a distribution (in terms of the components and minimum cuts) and at how unlabeled examples can potentially be used to prune away “incompatible” target concepts to reduce the number of labeled examples needed to learn. It is an open question to what extent the consistency constraints in the model and the mutual independence assumption of Section 5 can be relaxed and still allow provable results on the utility of co-training from unlabeled data. The preliminary experimental results presented suggest that this method of using unlabeled data has a potential for significant benefits in practice, though further studies are clearly needed.

We conjecture that there are many practical learning problems that fit or approximately fit the co-training model. For example, consider the problem of learning to classify segments of television broadcasts [7, 14]. We might be interested, say, in learning to identify televised segments containing the US President. Here Xl could be the set of possible video images, X2 the set of possible audio signals, and X their cross product. Given a small sample of labeled segments, we might learn a weakly predictive recognizer hl that spots full-frontal images of the president's face, and a recognizer h1 that spots his voice when no background noise is present. We could then use co-training applied to the large volume of unlabeled television broadcasts, to improve the accuracy of both classifiers. Similar problems exist in many perception learning tasks involving multiple sensors. For example, consider a mobile robot that must learn to recognize open doorways based on a collection of vision (Xl). sonar (X2), and laser range (X3) sensors. The important structure in the above problems is that each instance x can be partitioned into subcomponents xi, where the [math]\displaystyle{ x_i }[/math] are not perfectly correlated, where each [math]\displaystyle{ x_i }[/math] can in principle be used on its own to make the classification, and where a large volume of unlabeled instances can easily be collected.

References

  • 1 M. Craven, Dayne Freitag, Andrew McCallum, Tom M. Mitchell, K. Nigam, artd C.Y. Quek. Learning to extract symbolic knowledge from the world wide web. Technical report, Carnegie Mellon University, January (1997).
  • 2 Scott E. Decatur, PAC Learning with Constant-Partition Classification Noise and Applications to Decision Tree Induction, Proceedings of the Fourteenth International Conference on Machine Learning, p.83-91, July 08-12, 1997
  • 3 Arthur P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39:1 38, 1!)77.
  • 4 Richard O. Duda and Peter E. Hart. Pattern Classification and Scene Analysis. Wiley, 1973.
  • 5 Zoubin Ghahramani and Michael I. Jordan. Supervised learning from incomplete data via an EM approach. In Advances in Neural Information Processing Systems (NIPS 6). Morgan Kauffman, (1994).
  • 6 Sally A. Goldman, Michael J. Kearns, On the complexity of teaching, Journal of Computer and System Sciences, v.50 n.1, p.20-31, Feb. 1995 doi:10.1006/jcss.1995.1003
  • 7 Alexander G. Hauptmann, Michael J. Witbrock, Informedia: news-on-demand multimedia information acquisition and retrieval, Intelligent multimedia information retrieval, MIT Press, Cambridge, MA, 1997
  • 8 Jeffrey Jackson, Andrew Tomkins, A computational model of teaching, Proceedings of the fifth annual workshop on Computational learning theory, p.319-326, July 27-29, 1992, Pittsburgh, Pennsylvania, United States doi:10.1145/130385.130421
  • 9 David R. Karger, Random sampling in cut, flow, and network design problems, Proceedings of the twenty-sixth annual ACM symposium on Theory of computing, p.648-657, May 23-25, 1994, Montreal, Quebec, Canada doi:10.1145/195058.195422
  • 10 D. R. Karger. Randonl sampling in cut, flow, and network design problems. Journal version draft, (1997).
  • 11 Michael Kearns, Efficient noise-tolerant learning from statistical queries, Proceedings of the twenty-fifth annual ACM symposium on Theory of computing, p.392-401, May 16-18, 1993, San Diego, California, United States doi:10.1145/167088.167200
  • 12 D. D. Lewis and M. Ringuette. A comparison of two learning algorithms for text categorization. In Third Annual Symposium on Document Analysis and Information Retrieval, pages 81-93, (1994).
  • 13 Joel Ratsaby, Santosh S. Venkatesh, Learning from a mixture of labeled and unlabeled examples with parametric side information, Proceedings of the eighth annual conference on Computational learning theory, p.412-417, July 05-08, 1995, Santa Cruz, California, United States doi:10.1145/225298.225348
  • 14 M.J. Witbrock and A.G. Hauptmann. Improving acoustic models by watching television. Technical Report CMU-CS-98-110, Carnegie Mellon University, March 19 (1998).
  • 15 David Yarowsky, Unsupervised word sense disambiguation rivaling supervised methods, Proceedings of the 33rd annual meeting on Association for Computational Linguistics, p.189-196, June 26-30, 1995, Cambridge, Massachusetts doi:10.3115/981658.981684,


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
1998 CombiningLabAndUnlabDataCotrainingAvrim Blum
Tom M. Mitchell
Combining Labeled and Unlabeled Data with Co-trainingProceedings of COLT 1998 Conferencehttp://www.cs.cmu.edu/afs/cs/Web/People/avrim/Papers/cotrain.ps.gz10.1145/279943.2799621998