As per the documentation (https://docs.databricks.com/en/optimizations/spark-ui-guide/one-spark-task.html) , Window function without PARTITION BY statement results in single task on Spark.
Is this true, given Spark does distributed parallel processing, is not spark first performs window aggregation (for ex: max(date) over (order by some_column) or row_number() over(order by date_col) ) at partition level and later together from all partitions? why it results in single task?
1 Answer 1
Window functions based on global ordering cannot be calculated with multiple partitions.
For example: row_number() over(order by date_col) at some row depends on the count of ALL rows across ALL partitions with lower date_col. Therefore all data need to be gathered in a single partition, sorted and then assigned row number one by one.
There could be some tricks to overcome this. You could precalculate rolling sums of partition counts and add them to partition-local row numbers. This approach is used in this Dataset.withRowNumbers function. But you would need to reimplement it for other functions like max.
2 Comments
computes max() within each partition (b) then moves max() of each partition to single executor to arrive at final max() value?max(...) (1 record on output - here yes, we can aggregate at partition level, then combine results) and max(...) over (order by ...) (N records on output).Explore related questions
See similar questions with these tags.
explain()on your dataframe is the best way to see whether its true or not.