Spark MLlib Module
Jump to navigation
Jump to search
A Spark MLlib Module is a Spark module intended as a data analytics platform (a machine learning library for Spark-based cluster computing jobs).
- Context:
- It can be part of a Distributed Machine Learning System.
- It can have been superseded by an Spark.ML module.
- It can be used by a Spark ML Program.
- It can be accessed via a pyspark API.
- It can contain libraries, such as:
- Example(s):
- MLlib 2.2.0.
- MLlib 1.1.
- …
- Counter-Example(s):
- See: Scala Library, Scipy Library, Python, Pytorch, TensorFlow, Numpy.
References
2016a
- (Github, 2016) ⇒ http://lintool.github.io/bigdata-2016w/
- QUOTE: ... One might break down the "big data" stack in the manner shown on the right. At the bottom resides the execution infrastructure, which is responsible for coordinating computations across a cluster (examples include MapReduce and Spark). In the middle resides analytics infrastructure, which implements data mining and machine learning algorithms on top of the execution infrastructure (an example would be MLlib in Spark). At the top are the tools data scientists use to generate insights, built on top of the analytics infrastructure.
2016b
- (Meng et al., 2016) ⇒ Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. (2016). “MLlib: Machine Learning in Apache Spark.” In: The Journal of Machine Learning Research, 17. ISBN:1938-7228 arXiv:1505.06807
- QUOTE: In this work we present MLlib, Spark’s distributed machine learning library, and the largest such library. The library targets large-scale learning settings that benefit from data-parallelism or model-parallelism to store and operate on data or models. MLlib consists of fast and scalable implementations of standard learning algorithms for common learning settings including classification, regression, collaborative filtering, clustering, and dimensionality reduction. It also provides a variety of underlying statistics, linear algebra, and optimization primitives. Written in Scala and using native (C++ based) linear algebra libraries on each node, MLlib includes Java, Scala, and Python APIs, and is released as part of the Spark project under the Apache 2.0 license.
2015
- (Apache Spark, 2015) ⇒ https://spark.apache.org/docs/latest/mllib-guide.html
- QUOTE: MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives, as outlined below:
- Data types
- Basic statistics.
- Classification and regression.
- Collaborative filtering.
- Clustering.
- Dimensionality reduction.
- Feature extraction and transformation.
- Optimization (developer)
- QUOTE: MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives, as outlined below:
2014
- (Apache Spark, 2014) ⇒ https://spark.apache.org/docs/1.1.0/mllib-guide.html
- MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives, as outlined below:
2014b
- (Apache Spark, 2014) ⇒ https://spark.apache.org/docs/latest/mllib-classification-regression.html
- MLlib supports various methods for binary classification, multiclass classification, and regression analysis. The table below outlines the supported algorithms for each type of problem.