Data Binning System
Jump to navigation
Jump to search
A Data Binning System is a Data Pre-Processing System that implements a Data Binning Algorithm to solve a Data Binning Task.
- AKA: Data Bucketing System.
- Example(s):
- Counter-Examples:
- See: Data Bin, Histogram, Data Binning Task, Bucket Sorting Algorithm, Bin Evaluation Criterion.
References
2018a
- (Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/BIN#Science_and_mathematics Retrieved:2018-5-20.
- In mathematics, the histogram of a discrete variable that can acquire m different values is called an "m-bin histogram"
- In statistics, each of a series of ranges of numerical value into which data are sorted in statistical analysis (see Data binning)
- Bin (computational geometry), space partitioning data structure to enable fast region queries and nearest neighbor search
2018b
- (Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Data_binning Retrieved:2018-5-20.
- Data binning or bucketing is a data pre-processing technique used to reduce the effects of minor observation errors. The original data values which fall in a given small interval, a bin, are replaced by a value representative of that interval, often the central value. It is a form of quantization.
Statistical data binning is a way to group a number of more or less continuous values into a smaller number of "bins". For example, if you have data about a group of people, you might want to arrange their ages into a smaller number of age intervals. [1] It can also be used in multivariate statistics, binning in several dimensions at once.
- Data binning or bucketing is a data pre-processing technique used to reduce the effects of minor observation errors. The original data values which fall in a given small interval, a bin, are replaced by a value representative of that interval, often the central value. It is a form of quantization.
2018c
- (Google ML Glossary, 2018) ⇒ (2018). binning/bucketing. In: Machine Learning Glossary https://developers.google.com/machine-learning/glossary/ Retrieved: 2018-05-20.
- QUOTE: Converting a (usually continuous) feature into multiple binary features called buckets or bins, typically based on value range. For example, instead of representing temperature as a single continuous floating-point feature, you could chop ranges of temperatures into discrete bins. Given temperature data sensitive to a tenth of a degree, all temperatures between 0.0 and 15.0 degrees could be put into one bin, 15.1 to 30.0 degrees could be a second bin, and 30.1 to 50.0 degrees could be a third bin.
2018d
- (NMRProcFlow, 2018) ⇒ Bucketing. In: NMRProcFlow Quick Tutorial. Retrieved: 2018-05-20
- QUOTE: An NMR spectrum may contain several thousands of points, and therefore of variables. In order to reduce the data dimensionality binning is commonly used. In binning the spectra are divided into bins (so-called buckets) and the total area within each bin is calculated to represent the original spectrum. The more simple approach consists to divide all the spectra with uniform areas width (typically 0.04 ppm). Due to the arbitrary division of peaks, one bin may contain pieces from two or more peaks which may affect the data analysis. We have chosen to implement the Adaptive, Intelligent Binning method (De Meyer et al. 2008) that attempt to split the spectra so that each area common to all spectra contains the same resonance, i.e. belonging to the same metabolite. In such methods, the width of each area is then determined by the maximum difference of chemical shift among all spectra.
2016
- (Izrailev, 2016) ⇒ Sergei Izrailev (2016). Cut Numeric Values into Evenly Distributed Groups (PDF). Package ‘binr’ https://github.com/jabiru/binr
- QUOTE: bins - Cuts points in vector x into evenly distributed groups (bins). bins takes 3 separate approaches to generating the cuts, picks the one resulting in the least mean square deviation from the ideal cut - length(x) / target.bins points in each bin - and then merges small bins unless excat.groups is TRUE The 3 approaches are:
- Use quantiles, and increase the number of even cuts up to max.breaks until the number of groups reaches the desired number. See bins.quantiles.
- Start with a single bin with all the data in it and perform bin splits until either the desired number of bins is reached or there’s no reduction in error (the latter is ignored if exact.groups is TRUE). See bins.split.
- Start with length(table(x)) bins, each containing exactly one distinct value and merge bins until the desired number of bins is reached. If exact.groups is FALSE, continue merging until there’s no further reduction in error. See bins.merge.
- For each of these approaches, apply redistribution of points among existing bins until there’s no further decrease in error …
2008
- (De Meyer et al., 2008) ⇒ De Meyer, T., Sinnaeve, D., Van Gasse, B., Tsiporkova, E., Rietzschel, E. R., De Buyzere, M. L., ..., & Van Criekinge, W. (2008). "NMR-based characterization of metabolic alterations in hypertension using an adaptive, intelligent binning algorithm" (PDF). Analytical chemistry, 80(10), 3783-3790.
- ABSTRACT: As with every -omics technology, metabolomics requires new methodologies for data processing. Due to the large spectral size, a standard approach in NMR-based metabolomics implies the division of spectra into equally sized bins, thereby simplifying subsequent data analysis. Yet, disadvantages are the loss of information and the occurrence of artifacts caused by peak shifts. Here, a new binning algorithm, Adaptive Intelligent Binning (AI-Binning), which largely circumvents these problems, is presented. AI-Binning recursively identifies bin edges in existing bins, requires only minimal user input, and avoids the use of arbitrary parameters or reference spectra. The performance of AI-Binning is demonstrated using serum spectra from 40 hypertensive and 40 matched normotensive subjects from the Asklepios study. Hypertension is a major cardiovascular risk factor characterized by a complex biochemistry and, in most cases, an unknown origin. The binning algorithm resulted in an improved classification of hypertensive status compared with that of standard binning and facilitated the identification of relevant metabolites. Moreover, since the occurrence of noise variables is largely avoided, AI-Binned spectra can be unit-variance scaled. This enables the detection of relevant, low-intensity metabolites. These results demonstrate the power of AI-Binning and suggest the involvement of r-1 acid glycoproteins and choline biochemistry in hypertension.