Automatic Data Curation Task

An Automatic Data Curation Task is a Data Curation Task that can be solved by an automatic data curation system that implements an automatic data curation algorithm.

Example(s):
Counter-Example(s):
- Manual Data Curation Task.
See: Automated Task Solving System, Biocurator, Data Archaeology, Data Format Management, Annotation Task, Data Provenance Task, Entity Mention, Text Processing Task.

Reference(s)

2021a

(Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Data_curation Retrieved:2021-8-1.
- Data curation is the organization and integration of data collected from various sources. It involves annotation, publication and presentation of the data such that the value of the data is maintained over time, and the data remains available for reuse and preservation. Data curation includes "all the processes needed for principled and controlled data creation, maintenance, and management, together with the capacity to add value to data". ^[1] In science, data curation may indicate the process of extraction of important information from scientific texts, such as research articles by experts, to be converted into an electronic format, such as an entry of a biological database. ^[2] In the modern era of big data, the curation of data has become more prominent, particularly for software processing high volume and complex data systems.^[3] The term is also used in historical occasions and the humanities,^[4] where increasing cultural and scholarly data from digital humanities projects requires the expertise and analytical practices of data curation. ^[5] In broad terms, curation means a range of activities and processes done to create, manage, maintain, and validate a component. ^[6] Specifically, data curation is the attempt to determine what information is worth saving and for how long. Borgman, C (2015). Big data, little data, no data: Scholarship in the networked world. Cambridge, Massachusetts: MIT Press. pp. 13. ISBN 978-0-262-02856-1.</ref>

↑ Renée J. Miller, “Big Data Curation” in 20th International Conference on Management of Data (COMAD) 2014, Hyderabad, India, December 17–19, 2014
↑ Bio creative Glossary. Retrieved on 3 October 2016.
↑ Furht, Borko; Armando Escalante (2011). Handbook of Data Intensive Computing. Springer Science & Business Media. p. 32. ISBN 9781461414155. Retrieved 2 October 2016.
↑ Sabharwal, Arjun (2015). Digital Curation in the Digital Humanities: Preserving and Promoting Archival and Special Collections. Chandos Publishing. p. 60. ISBN 9780081001783. Retrieved 2 October 2016.
↑ "An Introduction to Humanities Data Curation" by Julia Flanders and Trevor Muñoz http://guide.dhcuration.org/intro/. Not available any more: archive.org
↑ Pilin Glossary. Not available any more: archive.org

2018

(Thirumuruganathan et al., 2018) ⇒ Saravanan Thirumuruganathan, Nan Tang, and Mourad Ouzzani (2018). "Data Curation with Deep Learning (Vision): Towards Self Driving Data Curation". In: [http://arxiv.org/abs/1803.01384 arXiv:1803.01384.
- QUOTE: Data Curation – the process of discovering, integrating and cleaning data for a specific analytics task, as shown in Figure 1 – is critical for any organization to extract real business value from their data; feeding flawed, redundant or incomplete data as input will produce nonsense output or “garbage” (a.k.a. garbage in, garbage out)(...)

**Figure 1:** A Data Curation Pipeline.

Data Curation Meets Deep Learning. This article investigates intriguing research opportunities for answering the following questions:

What does it take to significantly advance a challenging area such as DC?
How can we leverage techniques from DL for DC?
Given the many DL research efforts, how can we identify the most promising leads that are most relevant to DC?

2016

(Arocena et al., 2016) ⇒ Patricia C. Arocena, Boris Glavic, Giansalvatore Mecca, Renee J. Miller, Paolo Papotti, and Donatello Santoro (2016). "Benchmarking Data Curation Systems". In: Bulletin of the IEEE Computer Society Technical Committee on Data Engineering.
- QUOTE: Data curation includes the many tasks needed to ensure data maintains its value over time. Given the maturity of many data curation tasks, including data transformation and data cleaning, it is surprising that rigorous empirical evaluations of research ideas are so scarce. In this work, we argue that thorough evaluation of data curation systems imposes several major obstacles that need to be overcome (...)

2015

(Hassanzadeh & Miller, 2015) ⇒ Oktie Hassanzadeh, and Renee J. Miller (2015). "Automatic Curation of Clinical Trials Data in LinkedCT". In: Proceedings of the 14th International Semantic Web Conference (ISWC 2015) Part II. 10.1007/978-3-319-25010-6\_16 DOI:10.1007/978-3-319-25010-6\_16
- QUOTE: In this section, we describe the end-to-end curation process we have designed to construct an up-to-date high-quality Linked Data source out of the XML data published by ClinicalTrials.gov. Figure 1 shows the overall architecture of the system.

**Figure 1:** LinkedCT Platform Architecture.

2012

(Tharatipyakul et al., 2012) ⇒ Atima Tharatipyakul, Somrak Numnark, Duangdao Wichadakul, and Supawadee Ingsriswang (2012). ["ChemEx: information extraction system for chemical data curation"]. In: Proceedings of the Eleventh International Conference on Bioinformatics (InCoB2012). [1].
- QUOTE: We have developed ChemEx, a chemical information extraction system. ChemEx processes both text and images in publications. Text annotator is able to extract compound, organism, and assay entities from text content while structure image recognition enables translation of chemical raster images to machine readable format. A user can view annotated text along with summarized information of compounds, organism that produces those compounds, and assay tests.
  ChemEx facilitates and speeds up chemical data curation by extracting compounds, organisms, and assays from a large collection of publications. The software and corpus can be downloaded from http://www.biotec.or.th/isl/ChemEx.

2006

(Wang & Sunderraman, 2006) ⇒ Yan Chao Wang, and Rajshekhar Sunderraman (2006). "PDB Data Curation". In: Proceedings of the 28th International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2006). DOI:https://doi.org/10.1109/IEMBS.2006.259891.
- QUOTE: Figure 1 shows the architecture of PDB Data Curation System. From User Interface, Checking Filter gets PDB file, which checks the errors and reports them to Curation Engine. And then curated data by Curation Engine are sent to Database storage. The users can get better PDB file from our database after this process.

**Figure 1:** PDB Data Curation System Architecture.

[1] Renée J. Miller, “Big Data Curation” in 20th International Conference on Management of Data (COMAD) 2014, Hyderabad, India, December 17–19, 2014

[2] Bio creative Glossary. Retrieved on 3 October 2016.

[haofda-3] Furht, Borko; Armando Escalante (2011). Handbook of Data Intensive Computing. Springer Science & Business Media. p. 32. ISBN 9781461414155. Retrieved 2 October 2016.

[dicuhu-4] Sabharwal, Arjun (2015). Digital Curation in the Digital Humanities: Preserving and Promoting Archival and Special Collections. Chandos Publishing. p. 60. ISBN 9780081001783. Retrieved 2 October 2016.

[5] "An Introduction to Humanities Data Curation" by Julia Flanders and Trevor Muñoz http://guide.dhcuration.org/intro/. Not available any more: archive.org

[6] Pilin Glossary. Not available any more: archive.org

[1]

[2]

[3]

[4]

[5]

[6]

Automatic Data Curation Task

Reference(s)

2021a

2018

2016

2015

2012

2006

Navigation menu

Search