I want to achieve the below:
lag(column1, datediff(column2, column3)).over(window)
The offset is dynamic. I have tried using UDF as well, but it didn't work.
Any thoughts on how to achieve the above?
Vikas Sharma
2,2061 gold badge15 silver badges26 bronze badges
asked Aug 30, 2017 at 13:00
asif syed
311 gold badge1 silver badge2 bronze badges
-
check the answer in here: stackoverflow.com/questions/36725353/… otherwise put more details about the problems and the data setMohamed Ali JAMAOUI– Mohamed Ali JAMAOUI2017年08月30日 13:04:34 +00:00Commented Aug 30, 2017 at 13:04
1 Answer 1
The argument count of the lag function takes an integer not a column object :
psf.lag(col, count=1, default=None)
Therefore it cannot be a "dynamic" value. Instead you can build your lag in a column and then join the table with itself.
First let's create our dataframe:
df = spark.createDataFrame(
sc.parallelize(
[[1, "2011-01-01"], [1, "2012-01-01"], [2, "2013-01-01"], [1, "2014-01-01"]]
),
["int", "date"]
)
We want to enumerate the rows:
from pyspark.sql import Window
import pyspark.sql.functions as psf
df = df.withColumn(
"id",
psf.monotonically_increasing_id()
)
w = Window.orderBy("id")
df = df.withColumn("rn", psf.row_number().over(w))
+---+----------+-----------+---+
|int| date| id| rn|
+---+----------+-----------+---+
| 1|2011年01月01日|17179869184| 1|
| 1|2012年01月01日|42949672960| 2|
| 2|2013年01月01日|68719476736| 3|
| 1|2014年01月01日|94489280512| 4|
+---+----------+-----------+---+
Now to build the lag:
df1 = df.select(
"int",
df.date.alias("date1"),
(df.rn - df.int).alias("rn")
)
df2 = df.select(
df.date.alias("date2"),
'rn'
)
Finally we can join them and compute the date difference:
df1.join(df2, "rn", "inner").withColumn(
"date_diff",
psf.datediff("date1", "date2")
).drop("rn")
+---+----------+----------+---------+
|int| date1| date2|date_diff|
+---+----------+----------+---------+
| 1|2012年01月01日|2011年01月01日| 365|
| 2|2013年01月01日|2011年01月01日| 731|
| 1|2014年01月01日|2013年01月01日| 365|
+---+----------+----------+---------+
answered Aug 30, 2017 at 14:57
MaFF
10.2k2 gold badges39 silver badges43 bronze badges
Sign up to request clarification or add additional context in comments.
Comments
Explore related questions
See similar questions with these tags.
lang-py