Spark DataFrame TimestampType - how to get Year, Month, Day values from field? Spark DataFrame TimestampType - how to get Year, Month, Day values from field? python python

Spark DataFrame TimestampType - how to get Year, Month, Day values from field?


Since Spark 1.5 you can use a number of date processing functions:

import datetimefrom pyspark.sql.functions import year, month, dayofmonthelevDF = sc.parallelize([    (datetime.datetime(1984, 1, 1, 0, 0), 1, 638.55),    (datetime.datetime(1984, 1, 1, 0, 0), 2, 638.55),    (datetime.datetime(1984, 1, 1, 0, 0), 3, 638.55),    (datetime.datetime(1984, 1, 1, 0, 0), 4, 638.55),    (datetime.datetime(1984, 1, 1, 0, 0), 5, 638.55)]).toDF(["date", "hour", "value"])elevDF.select(    year("date").alias('year'),     month("date").alias('month'),     dayofmonth("date").alias('day')).show()# +----+-----+---+# |year|month|day|# +----+-----+---+# |1984|    1|  1|# |1984|    1|  1|# |1984|    1|  1|# |1984|    1|  1|# |1984|    1|  1|# +----+-----+---+

You can use simple map as with any other RDD:

elevDF = sqlContext.createDataFrame(sc.parallelize([        Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55),        Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55),        Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=3, value=638.55),        Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=4, value=638.55),        Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=5, value=638.55)]))(elevDF .map(lambda (date, hour, value): (date.year, date.month, date.day)) .collect())

and the result is:

[(1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1)]

Btw: datetime.datetime stores an hour anyway so keeping it separately seems to be a waste of memory.


You can use functions in pyspark.sql.functions: functions like year, month, etc

refer to here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

from pyspark.sql.functions import *newdf = elevDF.select(year(elevDF.date).alias('dt_year'), month(elevDF.date).alias('dt_month'), dayofmonth(elevDF.date).alias('dt_day'), dayofyear(elevDF.date).alias('dt_dayofy'), hour(elevDF.date).alias('dt_hour'), minute(elevDF.date).alias('dt_min'), weekofyear(elevDF.date).alias('dt_week_no'), unix_timestamp(elevDF.date).alias('dt_int'))newdf.show()+-------+--------+------+---------+-------+------+----------+----------+|dt_year|dt_month|dt_day|dt_dayofy|dt_hour|dt_min|dt_week_no|    dt_int|+-------+--------+------+---------+-------+------+----------+----------+|   2015|       9|     6|      249|      0|     0|        36|1441497601||   2015|       9|     6|      249|      0|     0|        36|1441497601||   2015|       9|     6|      249|      0|     0|        36|1441497603||   2015|       9|     6|      249|      0|     1|        36|1441497694||   2015|       9|     6|      249|      0|    20|        36|1441498808||   2015|       9|     6|      249|      0|    20|        36|1441498811||   2015|       9|     6|      249|      0|    20|        36|1441498815|


Actually, we really do not need to import any python library. We can separate the year, month, date using simple SQL. See the below example,

+----------+|       _c0|+----------+|1872-11-30||1873-03-08||1874-03-07||1875-03-06||1876-03-04||1876-03-25||1877-03-03||1877-03-05||1878-03-02||1878-03-23||1879-01-18|

I have a date column in my data frame which contains the date, month and year and assume I want to extract only the year from the column.

df.createOrReplaceTempView("res")sqlDF = spark.sql("SELECT EXTRACT(year from `_c0`) FROM res ")

Here I'm creating a temporary view and store the year values using this single line and the output will be,

+-----------------------+|year(CAST(_c0 AS DATE))|+-----------------------+|                   1872||                   1873||                   1874||                   1875||                   1876||                   1876||                   1877||                   1877||                   1878||                   1878||                   1879||                   1879||                   1879|