Skip to main content
Stack Overflow
  1. About
  2. For Teams
Filter by
Sorted by
Tagged with
0 votes
1 answer
39 views

// Enable all bucketing optimizations spark.conf.set("spark.sql.requireAllClusterKeysForDistribution", "false") spark.conf.set("spark.sql.sources.bucketing.enabled&...
1 vote
0 answers
44 views

buckets is a column of type array<string>. The logic is similar to array_intersect, except only the prefix of each string in buckets (before the first -) is compared. How can I optimize the ...
Best practices
0 votes
5 replies
96 views

I have been working as a Data Engineer and got this issue. I came across a use case where I have a view(lets name it as inputView) which is created by reading data from some source. Now somewhere ...
Advice
0 votes
6 replies
163 views

So I am doing some SQL aggregation transformations of a dataset and there is a certain condition that I would like to do, but not sure how. Here is a basic code block: le_test = spark.sql(""&...
0 votes
0 answers
88 views

I created a table as follows: CREATE TABLE IF NOT EXISTS raw_data.civ ( date timestamp, marketplace_id int, ... some more columns ) USING ICEBERG PARTITIONED BY ( marketplace_id, ...
3 votes
1 answer
134 views

I have a PySpark job that reads data from table a, performs some transformations and filters, and then writes the result to table b. Here’s a simplified version of the code: import pyspark.sql....
0 votes
0 answers
97 views

I am observing different write behaviors when executing queries on EMR Notebook (correct behavior) vs when using spark-submit to submit a spark application to EMR Cluster (incorrect behavior). When I ...
0 votes
0 answers
48 views

We have a scenario to read a VSAM file directly along with a copy book to understand the column lengths, we were using COBRIX library as part of spark read. However, we could the same is not properly ...
0 votes
0 answers
41 views

I'm analyzing Spark event logs and have already retrieved the SparkListenerStageSubmitted and SparkListenerTaskEnd events to collect metrics such as spill, skew ratio, memory, and CPU usage. However, ...
0 votes
0 answers
63 views

I have a job on scala (v. 2.12.15) spark (v. 3.5.1) that works correctly and looks something like this: import org.apache.spark.sql.DataFrame ... val myDataFrame = myReadDataFunction(...) ....
0 votes
0 answers
57 views

currently I'm working in a specific version of Apache Spark (3.1.1) that cannot upgrade. Since that I can't use Apache Sedona and the version 1.3.1 is too slow. My problem is the following code that ...
1 vote
4 answers
115 views

does anybody know what am I doing wrong? Following is reduced code snippet working in spark-3.x but doesn't work in spark-4.x. In my usecase I need to pass complex data structure to udf (let's say ...
0 votes
1 answer
147 views

I am trying to read the _delta_log folder of a delta lake table via spark to export some custom metrics. I have configured how to get some metrics from history and description but I have problem ...
1 vote
0 answers
143 views

When I try to convert a pyspark DataFrame with a VariantType column to a pandas DataFrame, the conversion fails with an error 'NoneType' object is not iterable. Am I doing it incorrectly? Sample code: ...
3 votes
0 answers
78 views

I am trying to write a custom decoder function in Java targeting Spark 4.0: public class MyDataToCatalyst extends UnaryExpression implements NonSQLExpression, ExpectsInputTypes, Serializable { //.....

15 30 50 per page
1
2 3 4 5
...
1795

AltStyle によって変換されたページ (->オリジナル) /