Dask Framework
Jump to navigation
Jump to search
A Dask Framework is a Parallel Computing Framework.
- Context:
- …
- Example(s):
- Dask, v2021.01.1 (~2021/01/21).
- v1.2.2 / 2019-05-08
- v0.20.2 / 2018-11-15
- …
- Counter-Example(s):
- See: Python (Programming Language), Parallel Computing, RAPIDS Framework.
References
2021
- (Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Dask_(software) Retrieved:2021-1-26.
- Dask is an open source library for parallel computing written in Python. Originally developed by Matthew Rocklin, Dask is a community project maintained and sponsored by developers and organizations.
2021
- https://stackoverflow.com/a/60740519
- QUOTE:
- … Dask is a Python library which enables out of core parallelism and distribution of some popular Python libraries as well as custom functions. Take Pandas for example. Pandas is a popular library for working with Dataframes in Python. However it is single-threaded and the Dataframes you are working on must fit within memory. Dask has a subpackage called dask.dataframe which follows most of the same API as Pandas but instead breaks your Dataframe down into partitions which can be operated on in parallel and can be swapped in and out of memory. Dask uses Pandas under the hood, so each partition is a valid Pandas Dataframe. The overall Dask Dataframe can scale out and use multiple cores or multiple machines.
- RAPIDS is a collection of GPU accelerated Python libraries which follow the API of other popular Python packages.
- QUOTE:
2021
- https://docs.dask.org/en/latest/
- QUOTE: Dask is a flexible library for parallel computing in Python.
- Dask is composed of two parts:
- Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
- “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of dynamic task schedulers.
- Dask emphasizes the following virtues:
- Familiar: Provides parallelized NumPy array and Pandas DataFrame objects
- Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects.
- Native: Enables distributed computing in pure Python with access to the PyData stack.
- Fast: Operates with low overhead, low latency, and minimal serialization necessary for fast numerical algorithms
- Scales up: Runs resiliently on clusters with 1000s of cores
- Scales down: Trivial to set up and run on a laptop in a single process
- Responsive: Designed with interactive computing in mind, it provides rapid feedback and diagnostics to aid humans
2020
- https://medium.com/@prayankkul27/which-one-should-i-use-apache-spark-or-dask-22ad4a20ab77
- QUOTE: ... Daskis smaller and lighter weight compare to spark. Dask has fewer features. Dask uses and couples with libraries like numeric python(numpy), pandas, Scikit-learn to gain high-level functionality.
- Spark is written in Scala and supports various other languages such as R, Python, Java Whereas Dask is written in Python and only supports Python.
- Spark has its own ecosystem and it is well integrated with other Apache projects whereas Dask is a component of a large python ecosystem. Dask has the main aim to enhance and use libraries like pandas,numpy, scikit-learn.
- Spark is older and has become a dominant and well-trusted tool in the Big Data world. whereas Dask is younger and its extension of well trusted NumPy/Pandas/Scikit-learn/Jupyter stack.
- Spark Dataframe has its own API and memory model. Spark also implemented a large subset of complex SQL queries. Whereas Dask reuses Pandas API and memory model. it neither implemented SQL and query optimizer.
- When it comes to Machine learning Spark has MLlib that is easy to implement with spark Map-reduce style system. whereas Dask relies on and interoperates with existing popular machine learning and data science libraries like Scikit-Learn and XGBoost.
- Spark does not support multi-dimension array structure whereas Dask has full functionality of the numpy model.
- Spark can process graph model using graphX library whereas Dask does not have any library or model for graph processing.
- QUOTE: ... Daskis smaller and lighter weight compare to spark. Dask has fewer features. Dask uses and couples with libraries like numeric python(numpy), pandas, Scikit-learn to gain high-level functionality.