Newest 'apache-spark-sql' Questions

1. Home
2. Questions
3. AI Assist
4. Tags
5. Challenges
6. Chat
7. Articles
8. Users
9. Companies
11. Communities for your favorite technologies. Explore all Collectives
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Bring the best of human thought and AI automation together at your work. Learn more

26,919 questions

0 votes

1 answer

39 views

Why 2 tables bucketed by col1 and joined by (col1, col2) are shuffled?

// Enable all bucketing optimizations spark.conf.set("spark.sql.requireAllClusterKeysForDistribution", "false") spark.conf.set("spark.sql.sources.bucketing.enabled&...

user2417458's user avatar

user2417458

asked Dec 25 at 12:03

1 vote

0 answers

44 views

How to optimize special array_intersect in hive sql executed by spark engine?

buckets is a column of type array<string>. The logic is similar to array_intersect, except only the prefix of each string in buckets (before the first -) is compared. How can I optimize the ...

Dong Ye's user avatar

Dong Ye

asked Nov 22 at 17:27

Best practices

0 votes

5 replies

96 views

Pushing down filters in RDBMS with Java Spark

I have been working as a Data Engineer and got this issue. I came across a use case where I have a view(lets name it as inputView) which is created by reading data from some source. Now somewhere ...

Parth Sarthi Roy's user avatar

Parth Sarthi Roy

asked Nov 14 at 6:13

Advice

0 votes

6 replies

163 views

Pyspark SQL: How to do GROUP BY with specific WHERE condition

So I am doing some SQL aggregation transformations of a dataset and there is a certain condition that I would like to do, but not sure how. Here is a basic code block: le_test = spark.sql(""&...

BeaverFever's user avatar

BeaverFever

asked Nov 2 at 6:39

0 votes

0 answers

88 views

How to Check if a Query Touches Data Files or just Uses Manifests and Metadata in Iceberg

I created a table as follows: CREATE TABLE IF NOT EXISTS raw_data.civ ( date timestamp, marketplace_id int, ... some more columns ) USING ICEBERG PARTITIONED BY ( marketplace_id, ...

shiva's user avatar

shiva

2,781

asked Oct 25 at 15:11

3 votes

1 answer

134 views

How to collect multiple metrics with observe in PySpark without triggering multiple actions

I have a PySpark job that reads data from table a, performs some transformations and filters, and then writes the result to table b. Here’s a simplified version of the code: import pyspark.sql....

עומר אמזלג's user avatar

עומר אמזלג

asked Oct 22 at 15:17

0 votes

0 answers

97 views

Unexpected Write Behavior when using MERGE INTO/INSERT INTO Iceberg Spark Queries

I am observing different write behaviors when executing queries on EMR Notebook (correct behavior) vs when using spark-submit to submit a spark application to EMR Cluster (incorrect behavior). When I ...

shiva's user avatar

shiva

2,781

asked Oct 21 at 20:58

0 votes

0 answers

48 views

Spark: VSAM File read issue with special character

We have a scenario to read a VSAM file directly along with a copy book to understand the column lengths, we were using COBRIX library as part of spark read. However, we could the same is not properly ...

Rocky1989's user avatar

Rocky1989

asked Oct 15 at 7:06

0 votes

0 answers

41 views

How to link Spark event log stages to PySpark code or query?

I'm analyzing Spark event logs and have already retrieved the SparkListenerStageSubmitted and SparkListenerTaskEnd events to collect metrics such as spill, skew ratio, memory, and CPU usage. However, ...

Carol C's user avatar

Carol C

asked Oct 9 at 19:40

0 votes

0 answers

63 views

Scala spark: Why does DataFrame.transform calling a transform hang?

I have a job on scala (v. 2.12.15) spark (v. 3.5.1) that works correctly and looks something like this: import org.apache.spark.sql.DataFrame ... val myDataFrame = myReadDataFunction(...) ....

jd_sa's user avatar

jd_sa

asked Oct 7 at 18:07

0 votes

0 answers

57 views

Spatial join without Apache Sedona

currently I'm working in a specific version of Apache Spark (3.1.1) that cannot upgrade. Since that I can't use Apache Sedona and the version 1.3.1 is too slow. My problem is the following code that ...

matdlara's user avatar

matdlara

asked Oct 3 at 1:35

1 vote

4 answers

115 views

How to pass array of structure as parameter to udf in spark 4

does anybody know what am I doing wrong? Following is reduced code snippet working in spark-3.x but doesn't work in spark-4.x. In my usecase I need to pass complex data structure to udf (let's say ...

Jiri Humpolicek's user avatar

Jiri Humpolicek

asked Sep 22 at 12:51

0 votes

1 answer

147 views

Problem reading the _last_checkpoint file from the _delta_log directory of a delta lake table on s3

I am trying to read the _delta_log folder of a delta lake table via spark to export some custom metrics. I have configured how to get some metrics from history and description but I have problem ...

Melika Ghiasi's user avatar

Melika Ghiasi

asked Sep 7 at 10:50

1 vote

0 answers

143 views

Conversion of a pyspark DataFrame with a Variant column to pandas fails with an error

When I try to convert a pyspark DataFrame with a VariantType column to a pandas DataFrame, the conversion fails with an error 'NoneType' object is not iterable. Am I doing it incorrectly? Sample code: ...

Ghislain Fourny's user avatar

Ghislain Fourny

7,429

asked Aug 27 at 11:32

3 votes

0 answers

78 views

Cannot extend Spark UnaryExpression in Java

I am trying to write a custom decoder function in Java targeting Spark 4.0: public class MyDataToCatalyst extends UnaryExpression implements NonSQLExpression, ExpectsInputTypes, Serializable { //.....

Carsten's user avatar

Carsten

1,288

asked Aug 26 at 16:59

15 30 50 per page

2 3 4 5

...

1795 Next

CollectivesTM on Stack Overflow

Why 2 tables bucketed by col1 and joined by (col1, col2) are shuffled?

How to optimize special array_intersect in hive sql executed by spark engine?

Pushing down filters in RDBMS with Java Spark

Pyspark SQL: How to do GROUP BY with specific WHERE condition

How to Check if a Query Touches Data Files or just Uses Manifests and Metadata in Iceberg

How to collect multiple metrics with observe in PySpark without triggering multiple actions

Unexpected Write Behavior when using MERGE INTO/INSERT INTO Iceberg Spark Queries

Spark: VSAM File read issue with special character

How to link Spark event log stages to PySpark code or query?

Scala spark: Why does DataFrame.transform calling a transform hang?

Spatial join without Apache Sedona

How to pass array of structure as parameter to udf in spark 4

Problem reading the _last_checkpoint file from the _delta_log directory of a delta lake table on s3

Conversion of a pyspark DataFrame with a Variant column to pandas fails with an error

Cannot extend Spark UnaryExpression in Java

Hot Network Questions