2

I have a PySpark dataframe like this but with a lot more data:

user_id event_date
123 '2024-01-01 14:45:12.00'
123 '2024-01-02 14:45:12.00'
456 '2024-01-01 14:45:12.00'
456 '2024-03-01 14:45:12.00'

I drop duplicates of users, leaving the last event. I am using something like this:

df = df.orderBy(['user_id', 'event_date'], ascending=False).dropDuplicates(['user_id'])

When I was searching for the solution to some other problem, I found information that this approach may be non-deterministic. Am I doing it wrong? Should I use window functions instead?

Vikas Sharma
2,2061 gold badge15 silver badges26 bronze badges
asked Dec 6, 2024 at 10:38

1 Answer 1

2

When you call dropDuplicates() without passing any columns to it - it just drops the identical rows, no matter what, for all columns (so we may kind of call it "deterministic" - as all columns will have same values in different rows being dropped and only one is kept - doesn't matter which one).

This non-deterministic behaviour of dropDuplicates(), when passing a subset of columns, is a known issue and using orderBy() doesn't enforce determinism either because it's not guaranteed that the ordering will be maintained after applying dropDuplicates() due to Spark's internal implementations (partitioning, logical and physical plans etc).

It's recommended to use Windows functions instead:

from pyspark.sql import functions as F
from pyspark.sql.window import Window
window_spec = Window.partitionBy("user_id").orderBy(F.col("event_date").desc())
df_with_rank = df.withColumn("row_number", F.row_number().over(window_spec))
df_last_event = df_with_rank.filter(F.col("row_number") == 1).drop("row_number")
answered Dec 6, 2024 at 13:34
Sign up to request clarification or add additional context in comments.

2 Comments

"it's not guaranteed that the ordering will be maintained .." You mean that if I will do orderBy([order_columns]).dropDuplicates([drop_columns]) several times, each time result may be different?
@Myakotka247 Yes.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.