Are coalesce + orderBy commutative in Spark?

Asked 5 years ago

Viewed 751 times

I want to run the following code:

df = df.coalesce(1).orderBy(["my_col"])

but its execution will obviously bottleneck on a single task doing all the sort work.

I know it's possible to run the following:

df = df.orderBy(["my_col"]).coalesce(1)

however I am uncertain if Spark will maintain the ordering after the partitions are collapsed. Does it?

The second code will be preferred if so as the sort will be performed distributed and the results merged after, but I am worried it might not be properly preserved.

If it is preserved, this would mean the two are commutative!

Improve this question

edited Dec 15, 2020 at 2:10

asked Dec 15, 2020 at 0:40

vanhooser's user avatar

vanhooser

1,8179 silver badges23 bronze badges

Add a comment |

2 Answers 2

Sorted by: Reset to default

It's easy to know what Spark will do by using explain

> df = spark.range(1,100)
> df.coalesce(1).orderBy('id').explain()
== Physical Plan ==
*(2) Sort [id#0L ASC NULLS FIRST], true, 0
+- Coalesce 1
 +- *(1) Range (1, 100, step=1, splits=4)

So the answer is, they are not commutative.

Improve this answer

answered Dec 15, 2020 at 1:11

Dean Xu's user avatar

Dean Xu

4,7411 gold badge19 silver badges46 bronze badges

Comments

The first df.coalesce(1).orderBy("timestamp") is not efficient because the coalesce(1) has no real effect before a shuffle-inducing operation like orderBy. The approach df.orderBy("timestamp").coalesce(1) can be used to write out a single sorted file but can be very inefficient for large datasets. It's important to consider the size of your data and the cluster resources when using coalesce(1) in conjunction with orderBy.

Improve this answer

answered Jun 4, 2024 at 9:12

Jagesh Maharjan's user avatar

Jagesh Maharjan

9131 gold badge13 silver badges29 bronze badges

Comments

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

Are coalesce + orderBy commutative in Spark?

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related