I want to run the following code:
df = df.coalesce(1).orderBy(["my_col"])
but its execution will obviously bottleneck on a single task doing all the sort work.
I know it's possible to run the following:
df = df.orderBy(["my_col"]).coalesce(1)
however I am uncertain if Spark will maintain the ordering after the partitions are collapsed. Does it?
The second code will be preferred if so as the sort will be performed distributed and the results merged after, but I am worried it might not be properly preserved.
If it is preserved, this would mean the two are commutative!
2 Answers 2
It's easy to know what Spark will do by using explain
> df = spark.range(1,100)
> df.coalesce(1).orderBy('id').explain()
== Physical Plan ==
*(2) Sort [id#0L ASC NULLS FIRST], true, 0
+- Coalesce 1
+- *(1) Range (1, 100, step=1, splits=4)
So the answer is, they are not commutative.
Comments
The first df.coalesce(1).orderBy("timestamp") is not efficient because the coalesce(1) has no real effect before a shuffle-inducing operation like orderBy. The approach df.orderBy("timestamp").coalesce(1) can be used to write out a single sorted file but can be very inefficient for large datasets. It's important to consider the size of your data and the cluster resources when using coalesce(1) in conjunction with orderBy.
Comments
Explore related questions
See similar questions with these tags.