Record Canonicalization Algorithm
Jump to navigation
Jump to search
A Record Canonicalization Algorithm is an Algorithm that can solve a Record Canonicalization Task and be implemented into a Record Canonicalization System.
- AKA: Canonicalization Algorithm.
- Context:
- It can select the appropriate Record Attribute Values for each Record Attribute.
- Example(s):
- Return the most common string for each field value.
- (Culotta et al., 2007)
- See: Record Deduplication Task.
References
2009
- (Wick et al., 2009) ⇒ Michael Wick, Aron Culotta, Khashayar Rohanimanesh, and Andrew McCallum. (2009). “An Entity Based Model for Coreference Resolution.” In: Proceedings of the SIAM International Conference on Data Mining (SDM 2009).
2008
- (Wick et al., 2008) ⇒ Michael Wick, Khashayar Rohanimanesh, Karl Schultz, and Andrew McCallum. (2008). “A Unified Approach for Schema Matching, Coreference, and Canonicalization.” In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2008).
- Propose an algorithm that performs Joint Inference over the Coreference Resolution Task, Entity Reference Resolution Task, and Schema Matching Task.
- Applies Conditional Random Fields.
- Describes Features that encode Clauses in First-Order Logic.
- Implements Efficient Inference by Metropolis-Hastings.
- Achieves good Experimental Results on multiple Data Sets.
2007
- (Culotta et al., 2007) ⇒ Aron Culotta, Michael Wick, Robert Hall, Matthew Marzilli, and Andrew McCallum. (2007). “Canonicalization of Database Records using Adaptive Similarity Measures.” In: Proceedings of KDD-2007.
- Consider a research publication database such as Citeseer or Rexa that contains records gathered from a variety of sources using automated extraction techniques. Because the data comes from multiple sources, it is inevitable that an attribute such as a conference name will be referenced in multiple ways. Since the data is also the result of extraction, it may also contain errors. In the presence of this noise and variability, the system must generate a single, canonical record to display to the user.
- Record canonicalization is the problem of constructing one standard record representation from a set of duplicate records. In many databases, canonicalization is enforced with a set of rules that place limitations or guidelines for data entry. However, obeying these constraints is often tedious and error-prone. Additionally, such rules are not applicable when the database contains records extracted automatically from unstructured sources.
- Simple solutions to the canonicalization problem are often insufficient. For example, one can simply return the most common string for each field value. However, incomplete records are often more common than complete records. For instance, this approach may canonicalize a record as “J. Smith” when in fact the full name (John Smith) is much more desirable.
- In addition to being robust to noise, the system must also be able to adapt to user preferences. For example, some users may prefer abbreviated forms (e.g., KDD) instead of expanded forms (e.g., Conference on Knowledge Discovery and Data Mining). The system must be able detect and react to such preferences.