2005 EmailDataCleaning
Jump to navigation
Jump to search
- (Tang, Li et al., 2005) ⇒ Jie Tang, Hang Li, Yunbo Cao, Zhaohui Tang. (2005). “Email Data Cleaning" In: Proceedings of the Eleventh ACM SIGKDD International conference on Knowledge discovery in data mining (KDD-2005). doi:10.1145/1081870.1081926
Subject Headings: Email Data, Data Cleaning Task.
Notes
Cited By
Quotes
Abstract
- Addressed in this paper is the issue of 'email data cleaning' for text mining. Many text mining applications need take emails as input. Email data is usually noisy and thus it is necessary to clean it before mining. Several products offer email cleaning features, however, the types of noises that can be eliminated are restricted. Despite the importance of the problem, email cleaning has received little attention in the research community. A thorough and systematic investigation on the issue is thus needed. In this paper, email cleaning is formalized as a problem of non-text filtering and text normalization. In this way, email cleaning becomes independent from any specific text mining processing. A cascaded approach is proposed, which cleans up an email in four passes including non-text filtering, paragraph normalization, sentence normalization, and word normalization. As far as we know, non-text filtering and paragraph normalization have not been investigated previously. Methods for performing the tasks on the basis of Support Vector Machines (SVM) have also been proposed in this paper. Features in the models have been defined. Experimental results indicate that the proposed SVM based methods can significantly outperform the baseline methods for email cleaning. The proposed method has been applied to term extraction, a typical text mining processing. Experimental results show that the accuracy of term extraction can be significantly improved by using the data cleaning method.
,