Creating User Defined Function in Spark-SQL

sql apache-spark

You can do this, at least for filtering, if you're willing to use a language-integrated query.

For a data file dates.txt containing:

one,2014-06-01two,2014-07-01three,2014-08-01four,2014-08-15five,2014-09-15

You can pack as much Scala date magic in your UDF as you want but I'll keep it simple:

def myDateFilter(date: String) = date contains "-08-"

Set it all up as follows -- a lot of this is from the Programming guide.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)import sqlContext._// case class for your recordscase class Entry(name: String, when: String)// read and parse the dataval entries = sc.textFile("dates.txt").map(_.split(",")).map(e => Entry(e(0),e(1)))

You can use the UDF as part of your WHERE clause:

val augustEntries = entries.where('when)(myDateFilter).select('name, 'when)

and see the results:

augustEntries.map(r => r(0)).collect().foreach(println)

Notice the version of the where method I've used, declared as follows in the doc:

def where[T1](arg1: Symbol)(udf: (T1) ⇒ Boolean): SchemaRDD

So, the UDF can only take one argument, but you can compose several .where() calls to filter on multiple columns.

Edit for Spark 1.2.0 (and really 1.1.0 too)

While it's not really documented, Spark now supports registering a UDF so it can be queried from SQL.

The above UDF could be registered using:

sqlContext.registerFunction("myDateFilter", myDateFilter)

and if the table was registered

sqlContext.registerRDDAsTable(entries, "entries")

it could be queried using

sqlContext.sql("SELECT * FROM entries WHERE myDateFilter(when)")

For more details see this example.

sql apache-spark

In Spark 2.0, you can do this:

// define the UDFdef convert2Years(date: String) = date.substring(7, 11)// register to sessionsparkSession.udf.register("convert2Years", convert2Years(_: String))val moviesDf = getMoviesDf // create dataframe usual waymoviesDf.createOrReplaceTempView("movies") // 'movies' is used in sql belowval years = sparkSession.sql("select convert2Years(releaseDate) from movies")

sql apache-spark

In PySpark 1.5 and above, we can easily achieve this with builtin function.

Following is an example:

raw_data = [("2016-02-27 23:59:59", "Gold", 97450.56),("2016-02-28 23:00:00", "Silver", 7894.23),("2016-02-29 22:59:58", "Titanium", 234589.66)]Time_Material_revenue_df  = sqlContext.createDataFrame(raw_data, ["Sold_time", "Material", "Revenue"])from pyspark.sql.functions import  *Day_Material_reveneu_df = Time_Material_revenue_df.select(to_date("Sold_time").alias("Sold_day"), "Material", "Revenue")

CodeHunter

Creating User Defined Function in Spark-SQL

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last