Spark User-Defined Function (UDF)
Jump to navigation
Jump to search
A Spark User-Defined Function (UDF) is a UDF within Apache Spark.
- See: SQL UDF.
References
2018
- https://changhsinlee.com/pyspark-udf/
- QUOTE: Here’s the problem: I have a Python function that iterates over my data, but going through each row in the dataframe takes several days. If I have a computing cluster with many nodes, how can I distribute this Python function in PySpark to speed up this process — maybe cut the total time down to less than a few hours — with the least amount of work?
In other words, how do I turn a Python function into a Spark user defined function, or UDF? I’ll explain my solution here. ...
Registering a UDF: PySpark UDFs work in a similar way as the pandas .map() and .apply() methods for pandas series and dataframes. If I have a function that can use values from a row in the dataframe as input, then I can map it to the entire dataframe. The only difference is that with PySpark UDFs I have to specify the output data type.
- QUOTE: Here’s the problem: I have a Python function that iterates over my data, but going through each row in the dataframe takes several days. If I have a computing cluster with many nodes, how can I distribute this Python function in PySpark to speed up this process — maybe cut the total time down to less than a few hours — with the least amount of work?
2017
- https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-udfs.html
- QUOTE: User-Defined Functions (aka UDF) is a feature of Spark SQL to define new Column-based functions that extend the vocabulary of Spark SQL’s DSL for transforming Datasets.