Newest 'rdd' Questions

1. Home
2. Questions
3. AI Assist
4. Tags
5. Challenges
6. Chat
7. Articles
8. Users
9. Companies
11. Communities for your favorite technologies. Explore all Collectives
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Bring the best of human thought and AI automation together at your work. Learn more

4,064 questions

-1 votes

2 answers

67 views

How to Join two RDDs in pyspark with nested tuples

I need to join two RDDs as part of my programming assignment. The problem is that the first RDD is nested, while the other is flat. I tried different things, but nothing seemed to work. Is there any ...

Ahmed Sohail Aslam PhDCS 2025 's user avatar

Ahmed Sohail Aslam PhDCS 2025

asked Dec 1, 2025 at 3:46

1 vote

1 answer

111 views

How to properly recalculate Spark DataFrame statistics after checkpoint?

Here is minimal example using default data in DataBricks (Spark 3.4): import org.apache.spark.sql.functions.col import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types._ sc....

Igor Railean's user avatar

Igor Railean

asked May 15, 2025 at 20:43

0 votes

2 answers

144 views

Pyspark mapPartition evaluates the function more times than expected

I'm working with PySpark to process large amounts of data. However, I noticed that the function called by mapPartitions is executed one more time than expected. For instance, in the following code ...

sebenitezg's user avatar

sebenitezg

asked Dec 31, 2024 at 11:19

0 votes

1 answer

34 views

Find common data among two RDD in spark execution

I have RDD1 col1 col2 A x123 B y123 C z123 RDD2 col1 A C I want to run intersection of two RDDs and find common elements i.e. item that are in RDD2 what is the data of ...

Sachin Shrivastava's user avatar

Sachin Shrivastava

asked Oct 8, 2024 at 17:47

0 votes

1 answer

4k views

RDD is not implemented error on pyspark.sql.connect.dataframe.Dataframe

I have a dataframe on databricks on which I would like to use the RDD api on. The type of the dataframe is pyspark.sql.connect.dataframe.Dataframe after reading from the catalog. I found out that this ...

imawful's user avatar

imawful

asked Sep 25, 2024 at 8:16

0 votes

1 answer

68 views

unpacking nested tuples after Spark RDD join

The resources for this are scarce and I'm not sure that there's a solution to this issue. Suppose you have 3 simple RDD's. Or more specifically 3 PairRDD's. val rdd1: RDD[(Int, Int)] = sc.parallelize(...

Nizar's user avatar

Nizar

asked Sep 12, 2024 at 12:19

0 votes

0 answers

156 views

While in Jupyter notebook, while using pyspark, get Py4JJavaError when using simple .count

While using the following code: import pyspark from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql import SparkSession from pyspark.sql.types import Row from datetime ...

aemilius89's user avatar

aemilius89

asked Aug 30, 2024 at 20:18

-1 votes

1 answer

369 views

pySpark RDD whitelisted Class issues

I was used below code before enabled unity catalog cluster in azure databricks notebook but after changed shared users enabled cluster. i could not able to use below logic, how should we achieve ...

Developer Rajinikanth's user avatar

Developer Rajinikanth

asked Aug 27, 2024 at 11:53

1 vote

1 answer

79 views

avg() over a whole dataframe causing different output

I see that dataframe.agg(avg(Col) works fine, but when i calculate avg() over a window over whole column(not using any partition), i see different results based on which column i use with orderBy. ...

anurag86's user avatar

anurag86

1,707

asked Jul 26, 2024 at 7:39

3 votes

1 answer

93 views

Casting RDD to a different type (from float64 to double)

I have a code like below, which uses pyspark. test_truth_value = RDD. test_predictor_rdd = RDD. valuesAndPred = test_truth_value.zip(lasso_model.predict(test_predictor_rdd)).map(lambda x: ((x[0]), (x[...

Inkyu Kim's user avatar

Inkyu Kim

asked Jul 3, 2024 at 12:09

1 vote

1 answer

62 views

Saving and Loading RDD (pyspark) to pickle file is changing order of SparseVectors

I trained tf-idf on a pre-tokenized (unigram tokenizer) dataset that I converted from list[list(token1, token2, token3, ...)] to an RDD using pyspark's HashingTF and IDF implementations. I tried to ...

Caden's user avatar

Caden

asked Jun 25, 2024 at 19:15

1 vote

1 answer

633 views

Why is my PySpark row_number column messed up when applying a schema?

I want to apply a schema to specific non-technical columns of a Spark DataFrame. Beforehand, I add an artificial ID using Window and row_number so that I can later join some other technical columns to ...

stats_guy's user avatar

stats_guy

asked Jun 24, 2024 at 12:35

0 votes

0 answers

44 views

PySpark with RDD - How to calculate and compare averages?

I need to solve a problem where a company wants to offer k different users free use (a kind of coupon) of their application for two months. The goal is to identify users who are likely to churn (leave ...

Yoel Ha's user avatar

Yoel Ha

asked Jun 14, 2024 at 11:38

0 votes

1 answer

263 views

Order PySpark Dataframe by applying a function/lambda

I have a PySpark DataFrame which needs ordering on a column ("Reference"). The values in the column typically look like: ["AA.1234.56", "AA.1101.88", "AA.904.33"...

pymat's user avatar

pymat

1,192

asked Jun 10, 2024 at 9:13

-1 votes

1 answer

64 views

Problem with pyspark mapping - Index out of range after split

When trying to map our 6 column pyspark RDD into a 4d-tuple we get a list out of range error for any list element besides 0 which return the normal result. The dataset is structured like this: X,Y,FID,...

Toxicone 7's user avatar

Toxicone 7

asked Jun 7, 2024 at 13:41

15 30 50 per page

2 3 4 5

...

271 Next

CollectivesTM on Stack Overflow

How to Join two RDDs in pyspark with nested tuples

How to properly recalculate Spark DataFrame statistics after checkpoint?

Pyspark mapPartition evaluates the function more times than expected

Find common data among two RDD in spark execution

RDD is not implemented error on pyspark.sql.connect.dataframe.Dataframe

unpacking nested tuples after Spark RDD join

While in Jupyter notebook, while using pyspark, get Py4JJavaError when using simple .count

pySpark RDD whitelisted Class issues

avg() over a whole dataframe causing different output

Casting RDD to a different type (from float64 to double)

Saving and Loading RDD (pyspark) to pickle file is changing order of SparseVectors

Why is my PySpark row_number column messed up when applying a schema?

PySpark with RDD - How to calculate and compare averages?

Order PySpark Dataframe by applying a function/lambda

Problem with pyspark mapping - Index out of range after split

Hot Network Questions