- (Vepstas & Goertzel, 2014) ⇒ Linas Vepstas, and Ben Goertzel. (2014). "Learning Language from a Large (Unannotated) Corpus". eprint arXiv:1401.3372
Subject Headings: RelEx System; OpenCog System; Link Grammar; Natural Language Comprehension System; Natural Language Generation System.
A novel approach to the fully automated, unsupervised extraction of dependency grammars and associated syntax-to-semantic-relationship mappings from large text corpora is described. The suggested approach builds on the authors' prior work with the Link Grammar, RelEx and OpenCog systems, as well as on a number of prior papers and approaches from the statistical language learning literature. If successful, this approach would enable the mining of all the information needed to power a natural language comprehension and generation system, directly from a large, unannotated corpus.
1 Introduction
2 Algorithmic Overview
3 Assumed Linguistic Infrastructure
4 Linguistic Content To Be Learned
4.1 Deep Comprehension
5 A Methodology for Unsupervised Language Learning from a Large Corpus
5.1 A High Level Perspective on Language Learning
5.2 Learning Syntax
5.2.1 Loose language
5.2.2 Elaboration Of the Syntactic Learning Loop
5.3 Learning Semantics
Syntactic relationships provide only the shallowest interpretation of language; semantics comes next. One may view semantic relationships (including semantic relationships close to the syntax level, which we may call ”syntactico-semantic” relationships) as ensuing from syntactic relationships, via a similar but separate learning process to the one proposed above. Just as our approach to syntax learning is heavily influenced by our work with Link Grammar. our approach to semantics is heavily influenced by our work on the RelEx system [RVG05, LGE10, GPPG06, LGK+12], which maps the output of the Link Grammar parser into a more abstract semantic form. Prototype systems [GPA+10, LGK+12] have also been written mapping the output of RelEx into even more abstract semantic form, consistent with the semantics of the Probabilistic Logic Networks [GIGH08] formalism as implemented in the OpenCog [HG08] framework. These systems are largely based on hand-coded rules, and thus not in the spirit of language learning pursued in this proposal. However, they display the same structure that we assume here; the difference being that here we specify a mechanism for learning the linguistic content that fills in the structure via unsupervised corpus learning, obviating the need for hand-coding.
5.3.1 Elaboration Of the Semantic Learning Loop
6 The Importance of Incremental Learning
7 Conclusion
Appendix B: Mutual Information
Appendix A: Meaning-Text Theory
