Document Representation
A Document Representation is a knowledge representation of a document.
- …
- Example(s):
- See: Document Data Record, Document Classification System, Natural Language Processing.
References
2017a
- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Knowledge_representation_and_reasoning Retrieved:2017-7-1.
- Knowledge representation and reasoning (KR) is the field of artificial intelligence (AI) dedicated to representing information about the world in a form that a computer system can utilize to solve complex tasks such as diagnosing a medical condition or having a dialog in a natural language. Knowledge representation incorporates findings from psychologyabout how humans solve problems and represent knowledge in order to design formalisms that will make complex systems easier to design and build. Knowledge representation and reasoning also incorporates findings from logic to automate various kinds of reasoning, such as the application of rules or the relations of sets and subsets.
Examples of knowledge representation formalisms include semantic nets, systems architecture, frames, rules, and ontologies. Examples of automated reasoning engines include inference engines, theorem provers, and classifiers.
- Knowledge representation and reasoning (KR) is the field of artificial intelligence (AI) dedicated to representing information about the world in a form that a computer system can utilize to solve complex tasks such as diagnosing a medical condition or having a dialog in a natural language. Knowledge representation incorporates findings from psychologyabout how humans solve problems and represent knowledge in order to design formalisms that will make complex systems easier to design and build. Knowledge representation and reasoning also incorporates findings from logic to automate various kinds of reasoning, such as the application of rules or the relations of sets and subsets.
2017b
- (W3C, 2007A) ⇒ "5 HTML Document Representation" https://www.w3.org/TR/html4/charset.html Retrieved: 2017-07-01
- QUOTE: In this chapter, we discuss how HTML documents are represented on a computer and over the Internet.
The section on the document character set addresses the issue of what abstract characters may be part of an HTML document. Characters include the Latin letter "A", the Cyrillic letter "I", the Chinese character meaning "water", etc.
The section on character encodings addresses the issue of how those characters may be represented in a file or when transferred over the Internet. As some character encodings cannot directly represent all characters an author may want to include in a document, HTML offers other mechanisms, called character references, for referring to any character.
Since there are a great number of characters throughout human languages, and a great variety of ways to represent those characters, proper care must be taken so that documents may be understood by user agents around the world.
- QUOTE: In this chapter, we discuss how HTML documents are represented on a computer and over the Internet.
2009
- (Cambridge University Press, 2009) ⇒ "Document representations and measures of relatedness in vector spaces" https://nlp.stanford.edu/IR-book/html/htmledition/document-representations-and-measures-of-relatedness-in-vector-spaces-1.html Updated: 2009-04-07
- QUOTE: As in Chapter 6 , we represent documents as vectors in [math]\displaystyle{ \mathbb{R}^{\vert V\vert} }[/math] in this chapter. To illustrate properties of document vectors in vector classification, we will render these vectors as points in a plane as in the example in Figure 14.1 . In reality, document vectors are length-normalized unit vectors that point to the surface of a hypersphere. We can view the 2D planes in our figures as projections onto a plane of the surface of a (hyper-)sphere as shown in Figure 14.2 . Distances on the surface of the sphere and on the projection plane are approximately the same as long as we restrict ourselves to small areas of the surface and choose an appropriate projection (Exercise 14.1).
Decisions of many vector space classifiers are based on a notion of distance, e.g., when computing the nearest neighbors in kNN classification. We will use Euclidean distance in this chapter as the underlying distance measure. We observed earlier (Exercise 6.4.4 , page [*] ) that there is a direct correspondence between cosine similarity and Euclidean distance for length-normalized vectors. In vector space classification, it rarely matters whether the relatedness of two documents is expressed in terms of similarity or distance.
However, in addition to documents, centroids or averages of vectors also play an important role in vector space classification. Centroids are not length-normalized. For unnormalized vectors, dot product, cosine similarity and Euclidean distance all have different behavior in general (Exercise 14.8). We will be mostly concerned with small local regions when computing the similarity between a document and a centroid, and the smaller the region the more similar the behavior of the three measures is.
- QUOTE: As in Chapter 6 , we represent documents as vectors in [math]\displaystyle{ \mathbb{R}^{\vert V\vert} }[/math] in this chapter. To illustrate properties of document vectors in vector classification, we will render these vectors as points in a plane as in the example in Figure 14.1 . In reality, document vectors are length-normalized unit vectors that point to the surface of a hypersphere. We can view the 2D planes in our figures as projections onto a plane of the surface of a (hyper-)sphere as shown in Figure 14.2 . Distances on the surface of the sphere and on the projection plane are approximately the same as long as we restrict ourselves to small areas of the surface and choose an appropriate projection (Exercise 14.1).
2008a
- (Arguello et al., 2008) ⇒ Arguello, J., Elsas, J. L., Callan, J., & Carbonell, J. G. (2008). Document Representation and Query Expansion Models for Blog Recommendation. ICWSM, 2008(0), 1.
- ABSTRACT: We explore several different document representation models and two query expansion models for the task of recommending blogs to a user in response to a query. Blog relevance ranking differs from traditional document ranking in ad-hoc information retrieval in several ways: (1) the unit of output (the blog) is composed of a collection of documents (the blog posts) rather than a single document, (2) the query represents an ongoing – and typically multifaceted – interest in the topic rather than a passing ad-hoc information need and (3) due to the propensity of spam, splogs, and tangential comments, the blogosphere is particularly challenging to use as a source for high-quality query expansion terms. We address these differences at the document representation level, by comparing retrieval models that view either the blog or its constituent posts as the atomic units of retrieval, and at the query expansion level, by making novel use of the links and anchor text in Wikipedia1 to expand a user’s initial query. We develop two complementary models of blog retrieval that perform at comparable levels of precision and recall. We also show consistent and significant improvement across all models using our Wikipedia expansion strategy.
2008b
- (Ranzato & Szummer,2008) ⇒ Ranzato, M. A., & Szummer, M. (2008, July). Semi-supervised learning of compact document representations with deep networks. In: Proceedings of the 25th International Conference on Machine learning (pp. 792-799). ACM [1].
- ABSTRACT: Finding good representations of text documents is crucial in information retrieval and classification systems. Today the most popular document representation is based on a vector of word counts in the document. This representation neither captures dependencies between related words, nor handles synonyms or polysemous words. In this paper, we propose an algorithm to learn text document representations based on semi-supervised autoencoders that are stacked to form a deep network. The model can be trained efficiently on partially labeled corpora, producing very compact representations of documents, while retaining as much class information and joint word statistics as possible. We show that it is advantageous to exploit even a few labeled samples during training.
2003
- (Blostein et al., 2003) ⇒ Blostein, D., Zanibbi, R., Nagy, G., & Harrap, R. (2003). Document representations PDF.
- ABSTRACT: Many document representations are in use. Each representation explicitly encodes different aspects of a document. External document representations, using standard file formats (such as JPEG, postscript, HTML, LaTeX), are used to communicate document-data between programs. Internal document representations are used within document analysis or document production software, to store intermediate results in the transformation from the input to output document representation. These document representations are central to defining and solving document analysis problems. Issues that can be investigated include defining equivalence of documents and distance between documents, mathematically characterizing the mapping between document representations, characterizing the external information needed to carry out these mappings, and characterizing the differences between the forward and inverse mappings that occur during document analysis and document production. From our ongoing investigation of these issues, we present a summary of internal document representations used in the table-recognition literature, and case studies of external document representations in the domains of circuit diagrams and text documents.
1994
- (Strzalkowski, 1994) ⇒ Strzalkowski, T. (1994, March). Document representation in natural language text retrieval. In: Proceedings of the workshop on Human Language Technology (pp. 364-369). Association for Computational Linguistics.
- ABSTRACT: In information retrieval, the content of a document may be represented as a collection of terms: words, stems, phrases, or other units derived or inferred from the text of the document. These terms are usually weighted to indicate their importance within the document which can then be viewed as a vector in a Ndimensional space. In this paper we demonstrate that a proper term weighting is at least as important as their selection, and that different types of terms (e.g., words, phrases, names), and terms derived by different means (e.g., statistical, linguistic) must be treated differently for a maximum benefit in rel~ieval. We report some observations made during and after the second Text REtrieval Conference (TREC-2).