Ray Machine Learning System
Ray Machine Learning System is a distributed Machine Learning system that supports deep learning and reinforcement learning applications.
- Context:
- It is a open-source project available at:
- It use an API based on a dynamical graph task.
- It uses Python and programming language,
- It uses a bottom-up hierarchical architeture for scheduling.
- Example(s):
- Counter-Example(s):
- See: Distributed Algorithms, Resilient Distributed Datasets (RDDS), Distributed Graph System, Iterative Machine Learning System.
References
2019a
- (RiseLab, 2019) ⇒ https://rise.cs.berkeley.edu/projects/ray/ Retrieved: 2019-05-05
- QUOTE: Ray is a high-performance distributed execution framework targeted at large-scale machine learning and reinforcement learning applications. It achieves scalability and fault tolerance by abstracting the control state of the system in a global control store and keeping all other components stateless. It uses a shared-memory distributed object store to efficiently handle large data through shared memory, and it uses a bottom-up hierarchical scheduling architecture to achieve low-latency and high-throughput scheduling. It uses a lightweight API based on dynamic task graphs and actors to express a wide range of applications in a flexible manner.
2019b
- (Ray Tutorial Docs, 2019) ⇒ https://ray.readthedocs.io/en/latest/tutorial.html Retrieved: 2019-05-05
- QUOTE: Ray is a distributed execution engine. The same code can be run on a single machine to achieve efficient multiprocessing, and it can be used on a cluster for large computations.
When using Ray, several processes are involved.
- Multiple worker processes execute tasks and store results in object stores. Each worker is a separate process.
- One object store per node stores immutable objects in shared memory and allows workers to efficiently share objects on the same node with minimal copying and deserialization.
- One raylet per node assigns tasks to workers on the same node.
- A driver is the Python process that the user controls. For example, if the user is running a script or using a Python shell, then the driver is the Python process that runs the script or the shell. A driver is similar to a worker in that it can submit tasks to its raylet and get objects from the object store, but it is different in that the raylet will not assign tasks to the driver to be executed.
- A Redis server maintains much of the system’s state. For example, it keeps track of which objects live on which machines and of the task specifications (but not data). It can also be queried directly for debugging purposes.
- QUOTE: Ray is a distributed execution engine. The same code can be run on a single machine to achieve efficient multiprocessing, and it can be used on a cluster for large computations.
2018a
- (Nishihara & Moritz, 2018) ⇒ Robert Nishihara, and Philipp Moritz (Jan 9, 2018). "Ray: A Distributed System for AI" Retrieved on 2019-04-14
- QUOTE: One of Ray’s goals is to enable practitioners to turn a prototype algorithm that runs on a laptop into a high-performance distributed application that runs efficiently on a cluster (or on a single multi-core machine) with relatively few additional lines of code. Such a framework should include the performance benefits of a hand-optimized system without requiring the user to reason about scheduling, data transfers, and machine failures.
(...) Relation to deep learning frameworks: Ray is fully compatible with deep learning frameworks like TensorFlow, PyTorch, and MXNet, and it is natural to use one or more deep learning frameworks along with Ray in many applications (for example, our reinforcement learning libraries use TensorFlow and PyTorch heavily).
Relation to other distributed systems: Many popular distributed systems are used today, but most of them were not built with AI applications in mind and lack the required performance for supporting and the APIs for expressing AI applications
(...) There are two main ways of using Ray: through its lower-level APIs and higher-level libraries. The higher-level libraries are built on top of the lower-level APIs. Currently these include Ray RLlib, a scalable reinforcement learning library and Ray.tune, an efficient distributed hyperparameter search library.
- QUOTE: One of Ray’s goals is to enable practitioners to turn a prototype algorithm that runs on a laptop into a high-performance distributed application that runs efficiently on a cluster (or on a single multi-core machine) with relatively few additional lines of code. Such a framework should include the performance benefits of a hand-optimized system without requiring the user to reason about scheduling, data transfers, and machine failures.
2018b
- (Moritz et al., 2018) ⇒ Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. (2018). “Ray: A Distributed Framework for Emerging AI Applications.” In: Proceedings of the 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18).
2018c
- (Liang et al., 2018) ⇒ Eric Liang, Richard Liaw, Philipp Moritz, Robert Nishihara, Roy Fox, Ken Goldberg, Joseph E. Gonzalez, Michael I. Jordan, and Ion Stoica. (2018). “RLlib: Abstractions for Distributed Reinforcement Learning.” In: Proceedings of Machine Learning Research.