Spark GraphX Graph Processing System
Jump to navigation
Jump to search
A Spark GraphX Graph Processing System is a graph data processing system.
- Context:
- It can be developed by Berkeley's AMPLab.
- It can support: PageRank, Connected components, Label propagation, SVD++, Strongly connected components, Triangle count.
- …
- Counter-Example(s):
- See: REST Web API.
References
2018
- https://spark.apache.org/docs/latest/graphx-programming-guide.html#overview
- QUOTE: GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge. To support graph computation, GraphX exposes a set of fundamental operators (e.g., subgraph, joinVertices, and aggregateMessages) as well as an optimized variant of the Pregel API. In addition, GraphX includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.
2017
- https://spark.apache.org/graphx/
- QUOTE: GraphX is Apache Spark's API for graphs and graph-parallel computation.
2015
- http://en.wikipedia.org/wiki/Graph_database#Distributed_Graph_Processing
- GraphLab built on the Spark cluster computing system. Dr. Joseph Gonzalez is the project lead, the creator of GraphLab.
2015
- https://amplab.cs.berkeley.edu/projects/graphx/
- Increasingly, data-science applications require the creation, manipulation, and analysis of large graphs ranging from social networks to language models. While existing graph systems (e.g., GraphBuilder, Titan, and Giraph) address specific stages of a typical graph-analytics pipeline (e.g., graph construction, querying, or computation), they do not address the entire pipeline, forcing the user to deal with multiple systems, complex and brittle file interfaces, and inefficient data-movement and duplication.
The GraphX project unifies graphs and tables enabling users to express an entire graph analytics pipeline within a single system. The GraphX interactive API makes it easy to build, query, and compute on large distributed graphs. In addition, GraphX includes a growing repository of graph algorithms for a range of analytics tasks. By casting recent advances in graph processings systems as distributed join optimizations, GraphX is able to achieve performance comparable to specialized graph processing systems while exposing a more flexible API. By building on top of recent advances in data-parallel systems, GraphX is able to achieve fault-tolerance while retaining in-memory performance and without the need for explicit checkpoint recovery.
GraphX is available as part of the Spark Apache Incubator project as of version 0.9.0, and the active research version of GraphX can be obtained from the github project page.
- Increasingly, data-science applications require the creation, manipulation, and analysis of large graphs ranging from social networks to language models. While existing graph systems (e.g., GraphBuilder, Titan, and Giraph) address specific stages of a typical graph-analytics pipeline (e.g., graph construction, querying, or computation), they do not address the entire pipeline, forcing the user to deal with multiple systems, complex and brittle file interfaces, and inefficient data-movement and duplication.