2012 AnOverviewoftheCPROD1ContestonC

From GM-RKB
Jump to navigation Jump to search

Subject Headings: CPROD1 Contest; CPROD1 Dataset

Notes

Cited By

Quotes

Abstract

A significant proportion of web content and its usage is due to the discussion-of and research-into consumer products. Currently however no benchmark dataset exists to evaluate the performance of text mining systems that can accurately identify and disambiguate product entities within a large product catalog. This paper presents an overview of the CPROD1 text mining contest which ran from July 2nd to Sept. 24th 2012 as part of the 21st International Data Mining Conference (ICDM-2012) that addressed this gap.

I. INTRODUCTION

A significant proportion of web usage relates to the discussions, research, and purchase of consumer products. Hundreds of thousands of blogs, forums, product review sites and e-commerce merchants currently publish information on consumer products, and a growing number of consumers use the Web to locate information on products.

This paper presents an overview of the CPROD1 text mining contest that ran from July 2nd to Sept. 24th 2012 as part of the 21st International Data Mining Conference (ICDM-2012). The contest required that contestants develop a system that can automatically recognize mentions of consumer products in previously unseen user-generated web content, and to link each mention to the corresponding set of products in a large product catalog. The contest datasets include hundreds of thousands of text-items, a product catalog with over fifteen million products, and hundreds of manually annotated product mentions to support data-driven approaches.

A high-level goal of the competition was to better understand which types of solutions can achieve winning performance on such a task. To incentivize participation, we offered a prize pool of $10,000 ($6,000 for first, $3,000 for second, and $1,000 for third).

The remainder of the paper is structured as follows: we begin with the rules and data files that contestants were given. Secondly, we present the annotation and data separation process. Next, we describe the evaluation metric and how contestants performed during the two rounds of testing. Finally, we review related benchmark tasks and present preliminary observations about the top submitted solutions.

...

IV. Data Preparation Process The annotation process for text-items involved two phases: 1) identifying the span of tokens within text-items that mention products, and 2) labeling product items with True/False labels for each annotated product mention. During the first phase, a set of text-items were randomly selected and reviewed by at least two different annotators. If there was disagreement, a third annotator broke ties. In the second phase, annotators classified which products were legitimate references. This phase was more time-consuming, so only a small portion of product candidates were reviewed by multiple annotators. The annotated text-items were then randomly separated into training set, leaderboard set, and model evaluation set (50%, 25%, and 25%).

V. Evaluation Metric Submissions of annotations were scored based on the average F1 score (between 0.0 and 1.0) for the union of predicted and true disambiguated product mentions. Tables 2 and 3 illustrate the performance calculation and final leaderboard results, respectively.

VI. Model Training Phase From July 2nd to September 15th, teams submitted predictions against the leaderboard-text.json file to evaluate performance and determine rankings. The table shows final scores, rankings, and the number of submissions by each team.

VII. Winner Selection Eight teams advanced to the final phase of the competition, submitting their predictive systems by September 15th and predictions against the evaluation-text.json file released on September 16th. Table 4 summarizes the F1 scores of all eight participants, with the three winning teams in bold.

II. CONTEST RULES

Beyond Kaggle’s standard terms and conditions, participants were requested to constrain their solutions in the following manner:

  • Participants were allowed to use additional data sources beyond the data provided by the contest, so long as: the data was publicly available, and the data was not manually transformed, such as by creating additional annotated content. If the data was based on a large Web-crawl, then we required that they included the corresponding crawler code and statistics of the resulting extract.
  • Participants were required to provide the following prior to the release of the winner-selection evaluation set: 1) a trained model, 2) any additional dataset(s) used, 3) the source code and documentation required to produce predictions using their model and additional dataset(s).

III. DATA FILES

The CPROD1 competition involved the release of six data files. Five of the files are provided immediately, while the model evaluation text-items were released near the end of the contest to determine the contest winners. Files are in two formats: JSON format and .CSV format. The six files are as follows:

  1. leaderboard-text.json: This JSON file contains the text-items that participants must disambiguate to determine their leaderboard score.
  2. products.json: This JSON file contains the product catalog that must be matched against.
  3. training-annotated-text.json: This JSON file contains the text-items that were manually reviewed for product mentions. This file, along with training-disambiguated- product-mentions.csv, could be used to train a supervised model.
  4. training-disambiguated-product-mentions.csv: This CSV file contains disambiguated product mentions. This file, along with the training-annotated-text.json file, could be used to train a supervised model. This file was in the same format as the required solutions.
  5. training-non-annotated-text.json: This JSON file contains supplementary text-item data drawn from the same domain as the other text-items in the contest. They are provided for participants who may opt to produce semi-supervised models.
  6. evaluation-text.json: This JSON file was provided near the end of the contest. It contained the text-item to be annotated for the final submission.

References

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2012 AnOverviewoftheCPROD1ContestonCGabor Melli
Christian Romming
An Overview of the CPROD1 Contest on Consumer Product Recognition Within User Generated Postings and Normalization Against a Large Product Catalog2012