Skip to main content
Stack Overflow
  1. About
  2. For Teams
Filter by
Sorted by
Tagged with
-1 votes
2 answers
67 views

I need to join two RDDs as part of my programming assignment. The problem is that the first RDD is nested, while the other is flat. I tried different things, but nothing seemed to work. Is there any ...
1 vote
1 answer
111 views

Here is minimal example using default data in DataBricks (Spark 3.4): import org.apache.spark.sql.functions.col import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types._ sc....
0 votes
2 answers
144 views

I'm working with PySpark to process large amounts of data. However, I noticed that the function called by mapPartitions is executed one more time than expected. For instance, in the following code ...
0 votes
1 answer
34 views

I have RDD1 col1 col2 A x123 B y123 C z123 RDD2 col1 A C I want to run intersection of two RDDs and find common elements i.e. item that are in RDD2 what is the data of ...
0 votes
1 answer
4k views

I have a dataframe on databricks on which I would like to use the RDD api on. The type of the dataframe is pyspark.sql.connect.dataframe.Dataframe after reading from the catalog. I found out that this ...
0 votes
1 answer
68 views

The resources for this are scarce and I'm not sure that there's a solution to this issue. Suppose you have 3 simple RDD's. Or more specifically 3 PairRDD's. val rdd1: RDD[(Int, Int)] = sc.parallelize(...
Nizar's user avatar
  • 763
0 votes
0 answers
156 views

While using the following code: import pyspark from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql import SparkSession from pyspark.sql.types import Row from datetime ...
-1 votes
1 answer
369 views

I was used below code before enabled unity catalog cluster in azure databricks notebook but after changed shared users enabled cluster. i could not able to use below logic, how should we achieve ...
1 vote
1 answer
79 views

I see that dataframe.agg(avg(Col) works fine, but when i calculate avg() over a window over whole column(not using any partition), i see different results based on which column i use with orderBy. ...
3 votes
1 answer
93 views

I have a code like below, which uses pyspark. test_truth_value = RDD. test_predictor_rdd = RDD. valuesAndPred = test_truth_value.zip(lasso_model.predict(test_predictor_rdd)).map(lambda x: ((x[0]), (x[...
1 vote
1 answer
62 views

I trained tf-idf on a pre-tokenized (unigram tokenizer) dataset that I converted from list[list(token1, token2, token3, ...)] to an RDD using pyspark's HashingTF and IDF implementations. I tried to ...
1 vote
1 answer
633 views

I want to apply a schema to specific non-technical columns of a Spark DataFrame. Beforehand, I add an artificial ID using Window and row_number so that I can later join some other technical columns to ...
0 votes
0 answers
44 views

I need to solve a problem where a company wants to offer k different users free use (a kind of coupon) of their application for two months. The goal is to identify users who are likely to churn (leave ...
0 votes
1 answer
263 views

I have a PySpark DataFrame which needs ordering on a column ("Reference"). The values in the column typically look like: ["AA.1234.56", "AA.1101.88", "AA.904.33"...
-1 votes
1 answer
64 views

When trying to map our 6 column pyspark RDD into a 4d-tuple we get a list out of range error for any list element besides 0 which return the normal result. The dataset is structured like this: X,Y,FID,...

15 30 50 per page
1
2 3 4 5
...
271

AltStyle によって変換されたページ (->オリジナル) /