Newest 'apache-spark' Questions

1. Home
2. Questions
3. AI Assist
4. Tags
5. Challenges
6. Chat
7. Articles
8. Users
9. Companies
11. Communities for your favorite technologies. Explore all Collectives
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Bring the best of human thought and AI automation together at your work. Learn more

82,628 questions

0 votes

1 answer

41 views

Why 2 tables bucketed by col1 and joined by (col1, col2) are shuffled?

// Enable all bucketing optimizations spark.conf.set("spark.sql.requireAllClusterKeysForDistribution", "false") spark.conf.set("spark.sql.sources.bucketing.enabled&...

user2417458's user avatar

user2417458

asked Dec 25 at 12:03

1 vote

0 answers

45 views

'JavaPackage' object is not callable error when trying to getOrCreate() local spark session

I have set up a small Xubuntu machine with intention of making it my single-node playaround Spark cluster. The cluster seems to be set up correctly - I can access the WebUI at port 8080, it shows a ...

Paweł Sopel's user avatar

Paweł Sopel

asked Dec 25 at 10:22

0 votes

0 answers

15 views

EMR Spark cluster getting stuck on resizing

I have a EMR spark cluster, on which I have enabled EMR managed auto scaling as auto scaling configuration and primary - c5a.xlarge Core - c5a.xlarge Task - c5a.xlarge With these cluster ...

Koushik's user avatar

Koushik

asked Dec 24 at 5:14

0 votes

1 answer

94 views

Optimize code to flatten meta ads metrics data in spark

I have two spark scripts, first as a bronze script need to data form kafka topics each topic have ads platform data ( tiktok_insights, meta_insights, google_insights ). Structure are same, ( id, ...

Kuldeep KV's user avatar

Kuldeep KV

asked Dec 17 at 6:41

0 votes

0 answers

68 views

spark flatMapToPair reaching "no space left on device" due to large duplication of entries

First, my question is not on increasing disk space to avoid no space left error, but to understand what spark does, and hopefully how to improve my code. In short, here is the pseudo code: JavaRDD&...

Juh_'s user avatar

Juh_

15.8k

asked Dec 16 at 14:47

1 vote

2 answers

101 views

Difference between org.apache.hadoop.io.compress.CompressionCodec and org.apache.spark.io.CompressionCodec

I want to use a compression in bigdata processing, but there are two compression codecs. Anyone know the difference?

Angle Tom's user avatar

Angle Tom

1,150

asked Dec 14 at 10:05

Advice

0 votes

4 replies

78 views

Use RSA key snowflake connection options instead of Password

I want to connect to a Snowflake database from the Data Bricks notebook. I have an RSA key(.pem file) and I don't want to use a traditional method like username and password as it is not as secure as ...

Prafulla's user avatar

Prafulla

asked Dec 8 at 16:58

0 votes

1 answer

110 views

Does Databricks Spark SQL evaluate all CASE branches for UDFs?

I'm using Databricks SQL and have SQL UDFs for GeoIP / ISP lookups. Each UDF branches on IPv4 vs IPv6 using a CASE expression like: CASE WHEN ip_address LIKE '%:%:%' THEN -- IPv6 path ... ...

YJCMS's user avatar

YJCMS

asked Dec 8 at 8:13

1 vote

0 answers

118 views

Warning and performance issues when scanning delta tables

Why do I get multiple warnings WARN delta_kernel::engine::default::json] read_json receiver end of channel dropped before sending completed when scanning (pl.scan_delta(temp_path) a delta table that ...

gaut's user avatar

gaut

6,048

asked Dec 6 at 1:45

1 vote

1 answer

50 views

How to detect Spark application failure in SparkListener when no jobs are executed?

I have a class that extends SparkListener and has access to SparkContext. I'm wondering if there is any way to check in onApplicationEnd whether the Spark application stopped because of an error or ...

apache-spark

tnazarew's user avatar

tnazarew

asked Dec 5 at 4:28

0 votes

0 answers

39 views

How to dynamically cast columns in a dbt-spark custom materialization to resolve UNION ALL schema mismatch?

I am working on a custom materialization in dbt using the dbt-spark adapter (writing to Delta tables on S3). The goal is to handle a hybrid SCD Type 1 and Type 2 strategy. The Logic I compare the ...

HoanggLB2k2's user avatar

HoanggLB2k2

asked Dec 2 at 9:23

2 votes

0 answers

64 views

How log model in mlflow using Spark Connect

I have the following setup: Kubernetes cluster with Spark Connect 4.0.1 and MLflow tracking server 3.5.0 MLFlow tracking server should serve all artifacts and is configured this way: --backend-store-...

hage's user avatar

hage

6,213

asked Nov 26 at 13:39

0 votes

1 answer

71 views

Handle corrupted files in spark load()

I have a spark job that runs daily to load data from S3. These data are composed of thousands of gzip files. However, in some cases, there is one or two corrupted files in S3, and it causes the whole ...

Nakeuh's user avatar

Nakeuh

1,933

asked Nov 26 at 7:17

-1 votes

2 answers

62 views

Connectivity issues in standalone Spark 4.0

In Azure VM, I have installed standalone Spark 4.0. On the same VM I have Python 3.11 with Jupyter deployed. In my notebook I submitted the following program: from pyspark.sql import SparkSession ...

Ziggy's user avatar

Ziggy

asked Nov 24 at 16:16

1 vote

1 answer

143 views

PicklingError: Could not serialize object: RecursionError in pyspark code in Jupyter Notebook

I am very new in Spark (specifically, have just started with learning), and I have encountered a recursion error in a very simple code. Background: Spark Version 3.5.7 Java Version 11.0.29 (Eclipse ...

GINzzZ100's user avatar

GINzzZ100

asked Nov 24 at 1:47

15 30 50 per page

2 3 4 5

...

5509 Next

CollectivesTM on Stack Overflow

Why 2 tables bucketed by col1 and joined by (col1, col2) are shuffled?

'JavaPackage' object is not callable error when trying to getOrCreate() local spark session

EMR Spark cluster getting stuck on resizing

Optimize code to flatten meta ads metrics data in spark

spark flatMapToPair reaching "no space left on device" due to large duplication of entries

Difference between org.apache.hadoop.io.compress.CompressionCodec and org.apache.spark.io.CompressionCodec

Use RSA key snowflake connection options instead of Password

Does Databricks Spark SQL evaluate all CASE branches for UDFs?

Warning and performance issues when scanning delta tables

How to detect Spark application failure in SparkListener when no jobs are executed?

How to dynamically cast columns in a dbt-spark custom materialization to resolve UNION ALL schema mismatch?

How log model in mlflow using Spark Connect

Handle corrupted files in spark load()

Connectivity issues in standalone Spark 4.0

PicklingError: Could not serialize object: RecursionError in pyspark code in Jupyter Notebook

Hot Network Questions