Data Curation Task

A Data Curation Task is a curation task that enhances the data quality and the data coverage of a curated database.

AKA: Curation, Digital Curation.
Context:
- Task Input:
  - A Database or dataset.
  - Additional Information such as new Data Records.
- Task Output: A Curated Database or dataset.
- Task Requirement(s): a Data Curator;
- It can range from being a Manual Data Curation Task to being an Automatic Data Curation Task.
- It can be supported by Data Curation System.
- It can include a Data Cleaning Task.
- It can include the review of an Annotated Artifact.
- It can be a Labor Intensive Task.
- It can support a Data Stewardship Task.
- …
Example(s):
- a Manual Information Extraction Task (which can be supported by an Information Retrieval System and an Information Extraction System),
- a Database Population Task,
- a Deep Learning Data Curation Task (Thirumuruganathan et al., 2018),
- PDB Data Curation Task (Wang & Sunderraman, 2006).
- LinkedCT (Hassanzadeh & Miller, 2015),
- ChemEx Curation Task (Tharatipyakul et al. 2012)
- …
Counter-Example(s):
See: Data Warehousing Task, Annotation Task, Data Provenance Task, Extract-Transform-Load (ETL) Task, Entity Mention, Text Processing Task.

Reference(s)

2021a

(Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Data_curation Retrieved:2021-8-1.
- Data curation is the organization and integration of data collected from various sources. It involves annotation, publication and presentation of the data such that the value of the data is maintained over time, and the data remains available for reuse and preservation. Data curation includes "all the processes needed for principled and controlled data creation, maintenance, and management, together with the capacity to add value to data". ^[1] In science, data curation may indicate the process of extraction of important information from scientific texts, such as research articles by experts, to be converted into an electronic format, such as an entry of a biological database. ^[2] In the modern era of big data, the curation of data has become more prominent, particularly for software processing high volume and complex data systems.^[3] The term is also used in historical occasions and the humanities,^[4] where increasing cultural and scholarly data from digital humanities projects requires the expertise and analytical practices of data curation. ^[5] In broad terms, curation means a range of activities and processes done to create, manage, maintain, and validate a component. ^[6] Specifically, data curation is the attempt to determine what information is worth saving and for how long. Borgman, C (2015). Big data, little data, no data: Scholarship in the networked world. Cambridge, Massachusetts: MIT Press. pp. 13. ISBN 978-0-262-02856-1.</ref>

↑ Renée J. Miller, “Big Data Curation” in 20th International Conference on Management of Data (COMAD) 2014, Hyderabad, India, December 17–19, 2014
↑ Bio creative Glossary. Retrieved on 3 October 2016.
↑ Furht, Borko; Armando Escalante (2011). Handbook of Data Intensive Computing. Springer Science & Business Media. p. 32. ISBN 9781461414155. Retrieved 2 October 2016.
↑ Sabharwal, Arjun (2015). Digital Curation in the Digital Humanities: Preserving and Promoting Archival and Special Collections. Chandos Publishing. p. 60. ISBN 9780081001783. Retrieved 2 October 2016.
↑ "An Introduction to Humanities Data Curation" by Julia Flanders and Trevor Muñoz http://guide.dhcuration.org/intro/. Not available any more: archive.org
↑ Pilin Glossary. Not available any more: archive.org

2021b

(BioCreAtIvE Glossary, 2021) ⇒ http://biocreative.sourceforge.net/biocreative_glossary.html Retrieved:2021-8-1.
- QUOTE: Curation (Biology): curation of biological databases in this context means basically the manual extraction of biological information from the literature by a domain expert. The aim is to transform information contained in free text (scientific literature) to information stored in form of a structured database record (biological databases).

2018

(Thirumuruganathan et al., 2018) ⇒ Saravanan Thirumuruganathan, Nan Tang, and Mourad Ouzzani (2018). "Data Curation with Deep Learning (Vision): Towards Self Driving Data Curation". In: [http://arxiv.org/abs/1803.01384 arXiv:1803.01384.
- QUOTE: Data Curation – the process of discovering, integrating and cleaning data for a specific analytics task, as shown in Figure 1 – is critical for any organization to extract real business value from their data; feeding flawed, redundant or incomplete data as input will produce nonsense output or “garbage” (a.k.a. garbage in, garbage out)(...)

**Figure 1:** A Data Curation Pipeline.

Data Curation Meets Deep Learning. This article investigates intriguing research opportunities for answering the following questions:

What does it take to significantly advance a challenging area such as DC?
How can we leverage techniques from DL for DC?
Given the many DL research efforts, how can we identify the most promising leads that are most relevant to DC?

2016

(Arocena et al., 2016) ⇒ Patricia C. Arocena, Boris Glavic, Giansalvatore Mecca, Renee J. Miller, Paolo Papotti, and Donatello Santoro (2016). "Benchmarking Data Curation Systems". In: Bulletin of the IEEE Computer Society Technical Committee on Data Engineering.
- QUOTE: Data curation includes the many tasks needed to ensure data maintains its value over time. Given the maturity of many data curation tasks, including data transformation and data cleaning, it is surprising that rigorous empirical evaluations of research ideas are so scarce. In this work, we argue that thorough evaluation of data curation systems imposes several major obstacles that need to be overcome (...)

2015

(Hassanzadeh & Miller, 2015) ⇒ Oktie Hassanzadeh, and Renee J. Miller (2015). "Automatic Curation of Clinical Trials Data in LinkedCT". In: Proceedings of the 14th International Semantic Web Conference (ISWC 2015) Part II. 10.1007/978-3-319-25010-6\_16 DOI:10.1007/978-3-319-25010-6\_16
- QUOTE: In this section, we describe the end-to-end curation process we have designed to construct an up-to-date high-quality Linked Data source out of the XML data published by ClinicalTrials.gov. Figure 1 shows the overall architecture of the system.

**Figure 1:** LinkedCT Platform Architecture.

2012

(Tharatipyakul et al., 2012) ⇒ Atima Tharatipyakul, Somrak Numnark, Duangdao Wichadakul, and Supawadee Ingsriswang (2012). ["ChemEx: information extraction system for chemical data curation"]. In: Proceedings of the Eleventh International Conference on Bioinformatics (InCoB2012). [1].
- QUOTE: We have developed ChemEx, a chemical information extraction system. ChemEx processes both text and images in publications. Text annotator is able to extract compound, organism, and assay entities from text content while structure image recognition enables translation of chemical raster images to machine readable format. A user can view annotated text along with summarized information of compounds, organism that produces those compounds, and assay tests.
  ChemEx facilitates and speeds up chemical data curation by extracting compounds, organisms, and assays from a large collection of publications. The software and corpus can be downloaded from http://www.biotec.or.th/isl/ChemEx.

2011

(Wikitionary) ⇒ http://en.wiktionary.org/wiki/curation
- … 3. (databases) The manual updating of information in a database.

2010

(DCC-JISC) ⇒ http://www.dcc.ac.uk/sites/default/files/DC%20101%20What%20is%20Digital%20Curation.pdf
- QUOTE: Digital curation, broadly interpreted, is about maintaining and adding value to a trusted body of digital information for current and future use.

2009

(Cusick et al., 2009) ⇒ Michael E Cusick, Haiyuan Yu, Alex Smolyar, Kavitha Venkatesan, Anne-Ruxandra Carvunis, Nicolas Simonis, Jean-François Rual, Heather Borick, Pascal Braun, Matija Dreze, Jean Vandenhaute, Mary Galli, Junshi Yazaki, David E Hill, Joseph R Ecker, Frederick P Roth, and Marc Vidal. (2009). “Literature-Curated Protein Interaction Datasets.” In: Nature Methods 6, 39 - 46 (2009)
- QUOTE: Our findings of large error rates in curated protein interaction databases, at least for yeast and human, are consistent with recent hints that the quality of literature curated datasets may not be as high as widely perceived 23,29,43–45. Perhaps occasionally curator error is responsible. However, we suggest that the errors are due not so much to curators but to the simple reality that extracting accurate information from a long free-text document can be extremely difficult. Gene name confusion is particularly thorny30,46. An example from our curated yeast sample illustrates the difficulties. A purification with a tandem affinity purification tag with Vps71/Swc6 (slash separates synonymous approved names) as bait47 pulls down a protein named Swc3, but double-checking this finds that the coresponding open reading frame is actually SWC3 (locus name YAL011w), and not the ALR1/SWC3 (locus name YOL130w) open reading frame curated in the database. A shared synonym thoroughly muddled the curation.

2008

(Howe et al., 2008) ⇒ Doug Howe, Maria Costanzo, Petra Fey, Takashi Gojobori, Linda Hannick, Winston Hide, David P. Hill, Renate Kania, Mary Schaeffer, Susan St Pierre, Simon Twigger, Owen White, and Seung Yon Rhee. (2008). “Big Data: The future of biocuration.” In: Nature, 455.

2006

(Wang & Sunderraman, 2006) ⇒ Yan Chao Wang, and Rajshekhar Sunderraman (2006). "PDB Data Curation". In: Proceedings of the 28th International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2006). DOI:https://doi.org/10.1109/IEMBS.2006.259891.
- QUOTE: Figure 1 shows the architecture of PDB Data Curation System. From User Interface, Checking Filter gets PDB file, which checks the errors and reports them to Curation Engine. And then curated data by Curation Engine are sent to Database storage. The users can get better PDB file from our database after this process.

**Figure 1:** PDB Data Curation System Architecture.

2005

(DCC, 2005) ⇒ http://www.dcc.ac.uk/about/what/
- QUOTE: What is Digital Curation?: Digital curation, broadly interpreted, is about maintaining and adding value to a trusted body of digital information for current and future use. The digital archiving and preservation community now looks beyond the preservation, cataloguing and cross referencing of static digital objects such as documents. The scientific community has data characterised by structure, volatility and scale. These require us to extend our notions of curation. We must also investigate the principles that underlie appraisal, and lessons learnt about the economics of preservation.

[1] Renée J. Miller, “Big Data Curation” in 20th International Conference on Management of Data (COMAD) 2014, Hyderabad, India, December 17–19, 2014

[2] Bio creative Glossary. Retrieved on 3 October 2016.

[haofda-3] Furht, Borko; Armando Escalante (2011). Handbook of Data Intensive Computing. Springer Science & Business Media. p. 32. ISBN 9781461414155. Retrieved 2 October 2016.

[dicuhu-4] Sabharwal, Arjun (2015). Digital Curation in the Digital Humanities: Preserving and Promoting Archival and Special Collections. Chandos Publishing. p. 60. ISBN 9780081001783. Retrieved 2 October 2016.

[5] "An Introduction to Humanities Data Curation" by Julia Flanders and Trevor Muñoz http://guide.dhcuration.org/intro/. Not available any more: archive.org

[6] Pilin Glossary. Not available any more: archive.org

[1]

[2]

[3]

[4]

[5]

[6]

Data Curation Task

Reference(s)

2021a

2021b

2018

2016

2015

2012

2011

2010

2009

2008

2006

2005

Navigation menu

Search