Number of Tasks - for Window function without PARTITION BY statement

Question 1

As per the documentation (https://docs.databricks.com/en/optimizations/spark-ui-guide/one-spark-task.html) , Window function without PARTITION BY statement results in single task on Spark.

Is this true, given Spark does distributed parallel processing, is not spark first performs window aggregation (for ex: max(date) over (order by some_column) or row_number() over(order by date_col) ) at partition level and later together from all partitions? why it results in single task?

Question 2

Perhaps running explain() on your dataframe is the best way to see whether its true or not.

Question 3

Because the PARTITION is how the data is divided up for parallel processing. Having no PARTITION BY clause is actually having a SINGLE partition, so the calculation can't be subdivided for parallel processing.

Question 4

Be aware of how window function max works - it calculates maximum value for rows between first and current row (by default) - here's example ->db-fiddle.com/f/4jyoMCicNSZpjMt4jFYoz5/0

Question 5

As per the documentation, that is NOT how window functions behave. Put aside your preconceptions and start wondering what must be true if the docs are not wrong.

Question 6

Yeah, thanks, here's correct one ->db-fiddle.com/f/7rkZSnDT661o3aq9XhbDpW/0

Question 7

Window functions based on global ordering cannot be calculated with multiple partitions.

For example: row_number() over(order by date_col) at some row depends on the count of ALL rows across ALL partitions with lower date_col. Therefore all data need to be gathered in a single partition, sorted and then assigned row number one by one.

There could be some tricks to overcome this. You could precalculate rolling sums of partition counts and add them to partition-local row numbers. This approach is used in this Dataset.withRowNumbers function. But you would need to reimplement it for other functions like max.

Question 8

got it,thanks . Lets say in case of max() over(<without partition>), does not spark (a) computes max() within each partition (b) then moves max() of each partition to single executor to arrive at final max() value?

Question 9

It's substantial difference between max(...) (1 record on output - here yes, we can aggregate at partition level, then combine results) and max(...) over (order by ...) (N records on output).

Kombajn zbożowy 10.8k5 gold badges33 silver badges72 bronze badges · Accepted Answer · 2024-08-04 21:57:49Z

Window functions based on global ordering cannot be calculated with multiple partitions.

For example: row_number() over(order by date_col) at some row depends on the count of ALL rows across ALL partitions with lower date_col. Therefore all data need to be gathered in a single partition, sorted and then assigned row number one by one.

There could be some tricks to overcome this. You could precalculate rolling sums of partition counts and add them to partition-local row numbers. This approach is used in this Dataset.withRowNumbers function. But you would need to reimplement it for other functions like max.

got it,thanks . Lets say in case of max() over(<without partition>), does not spark (a) computes max() within each partition (b) then moves max() of each partition to single executor to arrive at final max() value?
It's substantial difference between max(...) (1 record on output - here yes, we can aggregate at partition level, then combine results) and max(...) over (order by ...) (N records on output).

CollectivesTM on Stack Overflow

Number of Tasks - for Window function without PARTITION BY statement

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related