Skip to main content
  1. About
  2. Stack Internal
The 2026 Annual Developer Survey is live— take the Survey today!
Filter by
Sorted by
Tagged with
Filter by Employee ID
Advice
0 votes
1 replies
44 views

I have an use case where I need to replicate a sas behaviour into pyspark. In Sas, the Merge between two datasets are happening and its not 1*1 key merge. Its a m*n key merge involving multiple keys. ...
  • reputation score 331
Score of -5
0 answers
50 views

If my output is pipe delimited text file, does it the type of the fields matter or everything is written as text including numeric / currency fields and dates, for example if I want to format my date ...
  • reputation score 1
Score of 2
1 answer
84 views

I'm consuming two Binance streams: a trade stream and a kline (candlestick) stream. These are the schemas I'm using in my Spark job: ====================================================================...
  • reputation score 21
Score of -1
1 answer
101 views

I have imported the data from the attached Excel file. The dataset currently has the following structure: ISO, Name, 1993, 1994, 1995, ..., 2023 Each year is represented as a separate column, and new ...
Best practices
0 votes
4 replies
107 views

I am trying to run a big Databricks query with a lot of CTEs, etc. but I do not really want to run it in spark. Some parts of the query that work on the normal SQL warehouse do not work on spark. I am ...
  • reputation score 105
Score of 0
0 answers
98 views

i created a glue view through a glue job like this: CREATE OR REPLACE PROTECTED MULTI DIALECT VIEW risk_models_output.vw_behavior_special_limit_score SECURITY DEFINER AS [query ...
Score of 0
1 answer
108 views

How to replace groupBy + collect_list + array_sort with a more memory-efficient approach in Spark SQL? I have a Spark (Java) batch job that processes large telecom event data The job is failing with `...
Score of 0
1 answer
75 views

// Enable all bucketing optimizations spark.conf.set("spark.sql.requireAllClusterKeysForDistribution", "false") spark.conf.set("spark.sql.sources.bucketing.enabled&...
Score of 1
0 answers
60 views

buckets is a column of type array<string>. The logic is similar to array_intersect, except only the prefix of each string in buckets (before the first -) is compared. How can I optimize the ...
  • reputation score 11
Best practices
0 votes
5 replies
130 views

I have been working as a Data Engineer and got this issue. I came across a use case where I have a view(lets name it as inputView) which is created by reading data from some source. Now somewhere ...
Advice
0 votes
6 replies
191 views

So I am doing some SQL aggregation transformations of a dataset and there is a certain condition that I would like to do, but not sure how. Here is a basic code block: le_test = spark.sql(""&...
  • reputation score 21
Score of 0
0 answers
115 views

I created a table as follows: CREATE TABLE IF NOT EXISTS raw_data.civ ( date timestamp, marketplace_id int, ... some more columns ) USING ICEBERG PARTITIONED BY ( marketplace_id, ...
  • reputation score 2801
Score of 2
1 answer
252 views

I have a PySpark job that reads data from table a, performs some transformations and filters, and then writes the result to table b. Here’s a simplified version of the code: import pyspark.sql....
Score of 0
1 answer
268 views

I am observing different write behaviors when executing queries on EMR Notebook (correct behavior) vs when using spark-submit to submit a spark application to EMR Cluster (incorrect behavior). When I ...
  • reputation score 2801
Score of 0
0 answers
65 views

We have a scenario to read a VSAM file directly along with a copy book to understand the column lengths, we were using COBRIX library as part of spark read. However, we could the same is not properly ...

15 30 50 per page
1
2 3 4 5
...
1794

AltStyle によって変換されたページ (->オリジナル) /