Structured Dataset
A Structured Dataset is a Dataset that has a pre-defined data structure.
- AKA: Structured Data Type.
- Context:
- It is composed by data items that are be processed (read and understood) by both humans and machines.
- It can be displayed and stored as a Tabular Dataset.
- It can range from being a Primitive Data Type, to being a Composite Data Type, to being an Abstract Data Type.
- Example(s):
- Counter-Example(s):
- See: Data Model, Data Structure Diagram, Structured Data Analysis Task, Structure Data Mining Task, Structured Variable, Data Type, Database Management System.
References
2020a
- (Wikipedia, 2020) ⇒ https://en.wikipedia.org/wiki/Data_structure Retrieved:2020-3-7.
- In computer science, a data structure is a data organization, management, and storage format that enables efficient access and modification. More precisely, a data structure is a collection of data values, the relationships among them, and the functions or operations that can be applied to the data.
2020b
- (Wikipedia, 2020) ⇒ https://en.wikipedia.org/wiki/Unstructured_data Retrieved:2020-3-7.
- Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.
In 1998, Merrill Lynch said "unstructured data comprises the vast majority of data found in an organization, some estimates run as high as 80%." It's unclear what the source of this number is, but nonetheless it is accepted by some[1]. Other sources have reported similar or higher percentages of unstructured data. , IDC and Dell EMC project that data will grow to 40 zettabytes by 2020, resulting in a 50-fold growth from the beginning of 2010[2]. More recently, IDC and Seagate predict that the global datasphere will grow to 163 zettabytes by 2025 and majority of that will be unstructured. The Computer World magazine states that unstructured information might account for more than 70%–80% of all data in organizations.
- Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.
- ↑ Grimes, Seth (1 August 2008). "Unstructured Data and the 80 Percent Rule". Breakthrough Analysis - Bridgepoints. Clarabridge.
- ↑ "EMC News Press Release: New Digital Universe Study Reveals Big Data Gap: Less Than 1% of World's Data is Analyzed; Less Than 20% is Protected". www.emc.com. EMC Corporation. December 2012.
2020c
- (BDF, 2020) ⇒ https://www.bigdataframework.org/data-types-structured-vs-unstructured-data/ Retrieved:2020-3-7.
- QUOTE: Structured data is data that adheres to a pre-defined data model and is therefore straightforward to analyse. Structured data conforms to a tabular format with relationship between the different rows and columns. Common examples of structured data are Excel files or SQL databases. Each of these have structured rows and columns that can be sorted.
Structured data depends on the existence of a data model – a model of how data can be stored, processed and accessed. Because of a data model, each field is discrete and can be accesses separately or jointly along with data from other fields. This makes structured data extremely powerful: it is possible to quickly aggregate data from various locations in the database.
Structured data is is considered the most ‘traditional’ form of data storage, since the earliest versions of database management systems (DBMS) were able to store, process and access structured data.
- QUOTE: Structured data is data that adheres to a pre-defined data model and is therefore straightforward to analyse. Structured data conforms to a tabular format with relationship between the different rows and columns. Common examples of structured data are Excel files or SQL databases. Each of these have structured rows and columns that can be sorted.
2020d
- (Wikimedia, 2020) ⇒ https://commons.wikimedia.org/wiki/Commons:Structured_data Retrieved:2020-3-7.
- QUOTE: Structured data on Wikimedia Commons is multilingual information about a media file that can be understood by humans, with enough consistency that it can also be uniformly processed by machines. Files on Wikimedia Commons can be described with multilingual concepts from Wikidata, Wikimedia's knowledge base.
2020e
- (US SEC, 2020) ⇒ https://www.sec.gov/structureddata/what-is-structured-data Retrieved:2020-3-7.
- QUOTE: Structured data is data that is divided into standardized pieces that are identifiable and accessible by both humans and computers. The granularity of these pieces can range from an individual data point, such as a number (e.g., revenues), date (e.g., the date of a transaction), or text (e.g., a name), to data that includes multiple individual data points (e.g., an entire section of narrative disclosure). Structured data can be created and communicated using data standards like XBRL, XML, and JSON, or generated with web and pdf forms.
2019
- (Scipy, 2019) ⇒ https://docs.scipy.org/doc/numpy/user/basics.rec.html Last updated on Jul 26, 2019.
- QUOTE: Structured datatypes are designed to be able to mimic ‘structs’ in the C language, and share a similar memory layout. They are meant for interfacing with C code and for low-level manipulation of structured buffers, for example for interpreting binary blobs. For these purposes they support specialized features such as subarrays, nested datatypes, and unions, and allow control over the memory layout of the structure(...)
A structured datatype can be thought of as a sequence of bytes of a certain length (the structure’s itemsize) which is interpreted as a collection of fields. Each field has a name, a datatype, and a byte offset within the structure. The datatype of a field may be any numpy datatype including other structured datatypes, and it may also be a subarray data type which behaves like an
ndarray
of a specified shape. .
- QUOTE: Structured datatypes are designed to be able to mimic ‘structs’ in the C language, and share a similar memory layout. They are meant for interfacing with C code and for low-level manipulation of structured buffers, for example for interpreting binary blobs. For these purposes they support specialized features such as subarrays, nested datatypes, and unions, and allow control over the memory layout of the structure(...)
2013
- (Bellet et al., 2013) ⇒ Aurelien Bellet, Amaury Habrard, and Marc Sebban (2013). "A Survey on Metric Learning for Feature Vectors and Structured Data". In: Technical Report, Department of Computer Science University of Southern California. ArXiv:1306.6709.
- QUOTE: In many domains, data naturally come structured, as opposed to the “flat” feature vector representation we have focused on so far. Indeed, instances can come in the form of strings, such as words, text documents or DNA sequences; trees like XML documents, secondary structure of RNA or parse trees; and graphs, such as networks, 3D objects or molecules. In the context of structured data, metrics are especially appealing because they can be used as a proxy to access data without having to manipulate these complex objects. Indeed, given an appropriate structured metric, one can use any metric-based algorithm as if the data consisted of feature vectors.
1997
- (Koller & Avi, 1997) ⇒ Daphne Koller, and Avi Pfeffer. (1997). “Object-Oriented Bayesian Networks.” In: Proceedings of UAI (UAI 1997).
- Definition 2.2: A structured type is a set of values defined by a tuple ..., where ... are attribute labels and ... are corresponding (basic or structured) types. The set of values of this type are all those of the form ... A structured variable is a variable which takes values in some structured type.