我有以下sparkdataframe:
sale_id / created_at
1 / 2016 - 05 - 28 t05:53:31.042z
2 / 2016 - 05 - 30 - t12:50:58.184z
3 / 2016 - 05 - 23 - t10:22:18.858z
4 / 2016 - 05 - 27 - t09:20:15.158z
5 / 2016 - 05 - 21 - t08:30:17.337z
6 / 2016 - 05 - 28 - t07:41:14.361z
我需要添加一個year-week列,它包含年份和星期created_at列的每一行:
sale_id / created_at / year_week
1 / 2016 - 05 - 28 - t05:53:31.042z / 2016 - 21所示
2 / 2016 - 05 - 30 - t12:50:58.184z / 2016 - 22所示
3 / 2016 - 05 - 23 - t10:22:18.858z / 2016 - 21所示
4 / 2016 - 05 - 27 - t09:20:15.158z / 2016 - 21所示
5 / 2016 - 05 - 21 t08:30:17.337z / 2016 - 20
6 / 2016 - 05 - 28 t07:41:14.361z / 2016 - 21
pyspark公關SparkR或sparkSql都是可取的,我已經試過lubridate包但我列S4我收到下麵錯誤:
錯誤as.Date.default (head_df created_at美元):
錯誤as.Date.default (head_df created_at美元):不知道如何head_df created_at '美元轉換為類“日期”
val data = spark.read.option .option(“標題”、“true”) (“inferSchema”,“真正的”)。csv(“文件”的位置)
進口spark.implicits._
/ /創建列
val年= data.withColumn(“年”,年(數據(“created_at”)))
/ /創建weekof列
val周= data.withColumn(“周”,weekofyear(數據(“created_at”)))
/ /連接年份和星期列
val new_df =年+“-”+一周
new_df.show ()
/ /注意在SCALA代碼。我沒有IDE上測試這段代碼,但它應該工作很好。