Collective Classification Task

A Collective Classification Task is an classification task where the clasification items are graph members.

AKA: LBC, Link-based Object Classification Task.
Context:
- It can range from being a Collective Graph Node Classification Task to being a Collective Graph Edge Classification Task.
- It can be solved by a Collective Classification System (that implements a collective classification algorithm).
- It can (typically) expect that class labels are in a correlation relationship (correlated label) and exploiting these relationships can improve classification performance).
Example(s):
- Person Classification based on social network data, e.g. labeling the people that are likely to partake in some activity (e.g., smoking, IV drug use), have some disease (e.g., tuberculosis, obesity), or exhibit some behavior (buying a product, spreading a rumor). (Bilgic & Getoor, 2009).
- Publication Classification based on citation network database, e.g. labeling topics of the publications, or seminal vs. not seminal. (Bilgic & Getoor, 2009).
- Protein Classification based on a biological network database, e.g. inferring protein function. (Bilgic & Getoor, 2009).
See: Collective Inference Task, Collective Regression Algorithm, Link Mining Task, Link-based Edge Prediction Task.

References

2009

(Bilgic & Getoor, 2009) ⇒ Mustafa Bilgic, and Lise Getoor. (2009). “Reflect and Correct: A misclassification prediction approach to active inference.” In: ACM Transactions on Knowledge Discovery from Data (TKDD), 3(4). doi:10.1145/1631162.1631168
- We begin by reviewing the collective classification problem and define the objective function for label acquisition for collective classification. In this problem, we assume that our data is represented as a graph with nodes and edges, [math]\displaystyle{ G = (V, E) }[/math]. Each node [math]\displaystyle{ V_i }[/math] ∈ [math]\displaystyle{ V }[/math] is described by an attribute vector [math]\displaystyle{ \bf{X}_i }[/math] and a class label [math]\displaystyle{ Y_i }[/math] pair, [math]\displaystyle{ V_i }[/math] = <[math]\displaystyle{ \bf{X}_i }[/math], [math]\displaystyle{ Y_i }[/math]>. [math]\displaystyle{ \bf{X}_i }[/math] is a vector of individual attributes < X_i1, X_i2, . . ., X_ip . The domain of X_ij can be either discrete or continuous whereas the domain of the class label [math]\displaystyle{ Y_i }[/math]is discrete and denoted as {y₁, y₂, . . ., y_m}. Each edge E_ij ∈ $E$ describes some sort of relationship between its endpoints, E_ij = <[math]\displaystyle{ V_i }[/math],V_j>. Examples include:
  - Social Networks. Here, the nodes are people, the attributes may include demographic information such as age and income and the edges are friendships. The labels indicate categories of people, for example we may be interested in labeling the people that are likely to partake in some activity (e.g., smoking, IV drug use), have some disease (e.g., tuberculosis, obesity), or exhibit some behavior (buying a product, spreading a rumor).
  - Citation Networks. The nodes are publications, the attributes include content information and the edges represent citations. The labels may be the topics of the publications, or an indication of the reputation of the paper, for example whether the paper is seminal or not.
  - Biological Networks. Where for example, the nodes represent proteins, attributes include annotations, and edges represent interactions. In this domain for example, we may be interested in inferring protein function.
- In graph data, the labels of neighboring nodes are often correlated (though not necessarily positively correlated). For example, friends tend to have similar smoking behaviors, papers are likely to have similar topics to the papers that they cite, and proteins are likely to have complementary functions. Exploiting these correlations can significantly improve classification performance over using only the attributes, [math]\displaystyle{ \bf{X}_i }[/math], for the nodes. However, when predicting the label of a node, the labels of the related instances are also unknown and need to be predicted. Collective classification is the term used for simultaneously predicting the labels Y of [math]\displaystyle{ V }[/math] in the graph [math]\displaystyle{ G }[/math], where [math]\displaystyle{ Y }[/math] denotes the set of labels of all of the nodes, Y = {Y₁, Y₂, . . ., Y_n}. In general, the label [math]\displaystyle{ Y_i }[/math]of a node can be influenced by its own attributes [math]\displaystyle{ \bf{X}_i }[/math] as well as the labels Yj and attributes [math]\displaystyle{ \bf{X}_j }[/math] of other nodes in the graph.

2005

(Getoor & Diehl, 2005) ⇒ Lise Getoor, and Christopher P. Diehl. (2005). “Link Mining: A survey.” In: SIGKDD Explorations, 7(2). doi:10.1145/1117454.1117456
- Traditional machine learning has focused on the classification of data consisting of identically structured objects that are typically assumed to be IID. Many real-world datasets, however, lack this homogeneity of structure. In the link-based object classification (LBC) problem, a data graph G = (0;L) is composed of a set objects 0 connected to each other via a set of links L. The task is to label the members of 0 from a finite set of categorical values. The discerning feature of LBC that makes it different from traditional classification is that in many cases, the labels of related objects tend to be correlated. The challenge is to design algorithms for collective classification that exploit such correlations and jointly infer the categorical values associated with the objects in the graph.

Collective Classification Task

References

2009

2005

Navigation menu

Search