2016 FeatureSelectionADataPerspectiv
- (Li et al., 2016) ⇒ Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, and Huan Liu. (2016). “Feature Selection: {A} Data Perspective.” In: CoRR, abs/1601.07996.
Subject Headings: Feature Selection System.
Cited By
Feature selection, as a data preprocessing strategy, has been proven to be effective and efficient in preparing high-dimensional data for data mining and machine learning problems. The objectives of feature selection include: building simpler and more comprehensible models, improving data mining performance, and preparing clean, understandable data. The recent proliferation of big data has presented some substantial challenges and opportunities of feature selection algorithms. In this survey, we provide a comprehensive and structured overview of recent advances in feature selection research. Motivated by current challenges and opportunities in the big data age, we revisit feature selection research from a data perspective, and review representative feature selection algorithms for generic data, structured data, heterogeneous data and streaming data. Methodologically, to emphasize the differences and similarities of most existing feature selection algorithms for generic data, we generally categorize them into four groups: similarity based, information theoretical based, sparse learning based and statistical based methods. Finally, to facilitate and promote the research in this community, we also present a open-source feature selection repository that consists of most of the popular feature selection algorithms (this http URL). At the end of this survey, we also have a discussion about some open problems and challenges that need to be paid more attention in future research.
1. Introduction
We are now in the era of big data, where massive amounts of high dimensional data has become ubiquitous in our daily life, such as social media, e-commerce, health care, bioinformatics, transportation, online education, etc. Figure (1) shows an example by plotting the growth trend of UCI machine learning repository (Bache and Lichman, 2013). Rapid growth of data presents challenges for effective and efficient data management. Therefore, it is desirable and of great importance to apply data mining and machine learning techniques to automatically discover knowledge from these data.
Figure 1: Number of samples and number of features growth trend during the past thirty years in UCI machine learning repository.
When applying data mining and machine learning algorithms on high dimensional data, a critical issue is known as curse of dimensionality (Hastie et al., 2005). It refers to the phenomenon that data becomes sparser in high dimensional space, adversely affecting algorithms designed for low dimensional space. In addition, with the existence of a large number of features, learning models tend to overfit which may cause performance degradation on unseen data. Moreover, data of high dimension significantly increases the memory storage requirements and computational costs for data analytics.
Dimensionality reduction is one of the most powerful tools to address the previously described issues. It can be categorized mainly into into two main components: feature extraction and feature selection. Feature extraction projects original high dimensional feature space to a new feature space with low dimensionality. The new constructed feature space is usually a linear or nonlinear combination of the original feature space. Examples of feature extraction methods include Principle Component Analysis (PCA) (Jolliffe, 2002), Linear Discriminant Analysis (LDA) (Scholkopft and Mullert, 1999), Canonical Correlation Analysis (CCA) (Hardoon et al., 2004), Singular Value Decomposition (Golub and Van Loan, 2012), ISOMAP (Tenenbaum et al., 2000) and Locally Linear Embedding (LLE) (Roweis and Saul, 2000). Feature selection, on the other hand, directly selects a subset of relevant features for the use model construction. Lasso (Tibshirani, 1996), Information Gain (Cover and Thomas, 2012), Relief (Kira and Rendell, 1992a), MRMR (Peng et al., 2005), Fisher Score (Duda et al., 2012), Laplacian Score (He et al., 2005), and SPEC (Zhao and Liu, 2007) are some of the well known feature selection techniques.
Both feature extraction and feature selection have the advantage of improving learning performance, increasing computational efficiency, decreasing memory storage requirements, and building better generalization models. However, since feature extraction builds a set of new features, further analysis is problematic as we cannot get the physical meaning of these features in the transformed space. In contrast, by keeping some original features, feature selection maintains physical meanings of original features, and gives models better readability and interpretability. Therefore, feature selection is often preferred in many realworld applications such as text mining and genetic analysis compared to feature extraction.