36 questions
- Bountied 0
- Unanswered
- Frequent
- Score
- Trending
- Week
- Month
- Unanswered (my tags)
1
vote
1
answer
70
views
Range between windows function
I am trying to write a util function that gives min, max, sum, mean, first of any column cumulative within a window but I need to make it time aware. Should I use rangebetween of rowsbetween?
For ...
0
votes
1
answer
74
views
Number of Tasks - for Window function without PARTITION BY statement
As per the documentation (https://docs.databricks.com/en/optimizations/spark-ui-guide/one-spark-task.html) , Window function without PARTITION BY statement results in single task on Spark.
Is this ...
1
vote
1
answer
197
views
Spark (Scala): Moving average with Window function
The input dataframe it looks like this:
+---+----------+----------+--------+-----+-------------------+
| id|product_id|sales_date|quantity|price| timestampCol|
+---+----------+----------+--------...
2
votes
2
answers
128
views
PySpark: count over a window with reset
I have a PySpark DataFrame which looks like this:
df = spark.createDataFrame(
data=[
(1, "GERMANY", "20230606", True),
(2, "GERMANY", "20230620", ...
0
votes
0
answers
270
views
Parallelizing Spark's Pandas API Operations
Spark's Pandas API allows for Pandas functions to be performed on top of a Spark dataframe that looks and behaves like a Pandas Dataframe. Pandas has functions that Spark does not have implementations ...
0
votes
2
answers
343
views
How to perform average over months using window function with null values in between?
I have a dataframe like below
df = spark.createDataFrame(
[(1,1,10), (2,1,10), (3,1,None),(4,1,10),(5,1,10),(6,1,20) \
,(7,1,20), (1,2,10),(2,2,10),(3,2,10),(4,2,20),(5,2,20)],
["Month&...
0
votes
1
answer
145
views
How to get the other columns values using a window with rangeBetween in Pyspark
I have a table like this. I want to get the product_id of the row which has closet purchase_date (checking all rows before current row) and assign it to a new column (ref_id) for current's value for ...
2
votes
1
answer
2k
views
Window function ignore nulls not working in Databricks
I am new to Databricks and was required to implement the snowflake code in Databricks.
The snowflake table, code and output look like below:
table:
id
col1
hn
ee1
null
1
ee1
null
2
ee1
test
3
ee1
test
...
0
votes
1
answer
31
views
I want to fill in timestamps for a given code based on a window function in pyspark
I have a dataset and it’s output in the picture attached,
I want to create 3 new columns called start_time_1, start_time_2, start_time_3 such that I can update the first timestamps of each of the ...
2
votes
1
answer
408
views
PySpark group by with rolling window
Suppose I have a table with three columns: dt, id and value.
df_tmp = spark.createDataFrame([('2023-01-01', 1001, 5),
('2023-01-15', 1001, 3),
...
0
votes
1
answer
76
views
ADD end of month column Dynamically to spark Dataframe
I have pyspark Dataframe as follows,
I need to add EOM column to all the null values for each id dynamically based on last non null EOM value and it should be continuous.
My output dataframe looks ...
2
votes
1
answer
112
views
Spark - Calculating running sum with a threshold
I have a use-case where I need to compute running sum over a partition where the running sum does not exceed a certain threshold.
For example:
// Input dataset
| id | created_on | value | ...
0
votes
0
answers
107
views
In pyspark, (or SQL) can I use the value calculated in the previous observation in the current observation. (rowwise calculation) (Like SAS Retain)
I want to be able to consecutively go through a table using the value calculated in the previous row in the current row. It seems a window function could do this.
from pyspark.sql import SparkSession
...
0
votes
1
answer
75
views
Spark with scala [closed]
Consider 2 dataframes holiday df and everyday df with 3 columns as below
Holiday df: (5 records)
Country_code|currency_code| date
Gb | gbp | 2022年04月15日
Gb | gbp | ...
0
votes
1
answer
98
views
I want ntile(3) within ntile(3) as in subdivision within division by ntile()
I want to create a ntile(3) within an ntile(3).
I have the following table:
Customer
Total_amt
Digital_amt
1
100
45
2
200
150
3
150
23
4
300
100
5
350
350
6
112
10
7
312
15
8
260
160
9
232
150
10
190
...