9 questions
- Bountied 0
- Unanswered
- Frequent
- Score
- Trending
- Week
- Month
- Unanswered (my tags)
0
votes
1
answer
57
views
Why 2 tables bucketed by col1 and joined by (col1, col2) are shuffled?
// Enable all bucketing optimizations
spark.conf.set("spark.sql.requireAllClusterKeysForDistribution", "false")
spark.conf.set("spark.sql.sources.bucketing.enabled&...
0
votes
1
answer
231
views
Bucket records into batches of a certain size in Snowflake
What would be the best way to bucket records into batches of a predefined size? I would like to tag each record with a batch/bucket number for further processing.
For example, let's say I have 1110 ...
0
votes
1
answer
272
views
'save' does not support bucketBy and sortBy right now
I am trying to apply bucketing on my dataframe when saving it on HDFS using command bellow.
df.write
.format("parquet")
.bucketBy(200,"groupIdProjection")
.sortBy("...
1
vote
0
answers
113
views
What is bucketBy equivalent in spark dataframe V2 API or Iceberg?
We have Spark dataframe V1 API with bucketBy option.
df0.write
.bucketBy(50, "userid")
.saveAsTable("myHiveTable")
I don't see similar option in DataFrameWriterV2 API.
What ...
2
votes
0
answers
278
views
Why does Spark shuffle the data while joining two partitioned & bucketed tables
I am trying to create a view on top of two tables.
Table 1:
Partitioned by col1
Bucketed by col2 (no of buckets: 3600)
Table 2:
Partitioned by col1
Bucketed by col2 ( no of buckets:3600)
View:
Table1
...
1
vote
1
answer
76
views
CQL retrieve timeseries data by time range
I have sensors at different locations, each measuring multiple parameters. There will be around 2 millions of measurements per day per sensor. I need to query by location/time range, but the range ...
1
vote
0
answers
752
views
Bucketed joins in PySpark/Iceberg
I'm trying to perform a join between two tables in PySpark using the iceberg format. I'm trying to use bucketing to improve performance, and avoid a shuffle, but it appears to be having no effect ...
1
vote
0
answers
514
views
bucketing values in python
I want to split my values associating them with hash between buckets.
As HASH I am using embedded hash Python function which range is -abs(sys.maxsize) to sys.maxsize.
I have created a function to ...
0
votes
1
answer
374
views
Can I increase number of buckets after table creation in hive?
In hive, once the table is created with n buckets. Is their any way to increase number of buckets?