26,906 questions
- Bountied 0
- Unanswered
- Frequent
- Score
- Trending
- Week
- Month
- Unanswered (my tags)
Advice
0
votes
1
replies
44
views
Replicate SAS Merge happening in loop into Pyspark
I have an use case where I need to replicate a sas behaviour into pyspark. In Sas, the Merge between two datasets are happening and its not 1*1 key merge. Its a m*n key merge involving multiple keys. ...
- reputation score 331
Score of -5
0 answers
50 views
How do I format currency fields without comma (2 Dec) ,date columns to ‘MM/DD/YYYY’, for example for ‘2026年10月01日’ I want it to display as 10/02/2026,’ [closed]
If my output is pipe delimited text file, does it the type of the fields matter or everything is written as text including numeric / currency fields and dates, for example if I want to format my date ...
- reputation score 1
Score of 2
1 answer
84 views
How can a nested JSON field cause AMBIGUOUS_REFERENCE_TO_FIELDS in Spark?
I'm consuming two Binance streams: a trade stream and a kline (candlestick) stream. These are the schemas I'm using in my Spark job:
====================================================================...
- reputation score 21
Score of -1
1 answer
101 views
SQL Query for Unpivot compatible with SQLGLOT parser
I have imported the data from the attached Excel file. The dataset currently has the following structure:
ISO, Name, 1993, 1994, 1995, ..., 2023
Each year is represented as a separate column, and new ...
- reputation score 109
Best practices
0
votes
4
replies
107
views
Databricks run SQL query in python without spark
I am trying to run a big Databricks query with a lot of CTEs, etc. but I do not really want to run it in spark. Some parts of the query that work on the normal SQL warehouse do not work on spark. I am ...
- reputation score 105
Score of 0
0 answers
98 views
Can't SELECT anything in a AWS Glue Data Catalog view due to invalid view text: <REDACTED VIEW TEXT>
i created a glue view through a glue job like this:
CREATE OR REPLACE PROTECTED MULTI DIALECT VIEW risk_models_output.vw_behavior_special_limit_score
SECURITY DEFINER AS
[query ...
- reputation score 1
Score of 0
1 answer
108 views
Spark job fails with UnsafeExternalSorter OOM when using groupBy + collect_list + sort – how to optimize?
How to replace groupBy + collect_list + array_sort with a more memory-efficient approach in Spark SQL?
I have a Spark (Java) batch job that processes large telecom event data
The job is failing with `...
- reputation score 1
Score of 0
1 answer
75 views
Why 2 tables bucketed by col1 and joined by (col1, col2) are shuffled?
// Enable all bucketing optimizations
spark.conf.set("spark.sql.requireAllClusterKeysForDistribution", "false")
spark.conf.set("spark.sql.sources.bucketing.enabled&...
- reputation score 31
Score of 1
0 answers
60 views
How to optimize special array_intersect in hive sql executed by spark engine?
buckets is a column of type array<string>. The logic is similar to array_intersect, except only the prefix of each string in buckets (before the first -) is compared. How can I optimize the ...
- reputation score 11
Best practices
0
votes
5
replies
130
views
Pushing down filters in RDBMS with Java Spark
I have been working as a Data Engineer and got this issue.
I came across a use case where I have a view(lets name it as inputView) which is created by reading data from some source.
Now somewhere ...
- reputation score 1
Advice
0
votes
6
replies
191
views
Pyspark SQL: How to do GROUP BY with specific WHERE condition
So I am doing some SQL aggregation transformations of a dataset and there is a certain condition that I would like to do, but not sure how.
Here is a basic code block:
le_test = spark.sql(""&...
- reputation score 21
Score of 0
0 answers
115 views
How to Check if a Query Touches Data Files or just Uses Manifests and Metadata in Iceberg
I created a table as follows:
CREATE TABLE IF NOT EXISTS raw_data.civ (
date timestamp,
marketplace_id int,
... some more columns
)
USING ICEBERG
PARTITIONED BY (
marketplace_id,
...
- reputation score 2801
Score of 2
1 answer
252 views
How to collect multiple metrics with observe in PySpark without triggering multiple actions
I have a PySpark job that reads data from table a, performs some transformations and filters, and then writes the result to table b.
Here’s a simplified version of the code:
import pyspark.sql....
- reputation score 31
Score of 0
1 answer
268 views
Unexpected Write Behavior when using MERGE INTO/INSERT INTO Iceberg Spark Queries
I am observing different write behaviors when executing queries on EMR Notebook (correct behavior) vs when using spark-submit to submit a spark application to EMR Cluster (incorrect behavior).
When I ...
- reputation score 2801
Score of 0
0 answers
65 views
Spark: VSAM File read issue with special character
We have a scenario to read a VSAM file directly along with a copy book to understand the column lengths, we were using COBRIX library as part of spark read.
However, we could the same is not properly ...
- reputation score 409