Pyspark drop duplicates keep the non null row

Asked 1 year, 8 months ago

Viewed 40 times

-1

Hi I have a dataset like this :

id	name
1	A
1	null
2	A
3	B
4	null
4	B
5	A
6	null

And I want to remove duplicates row and keep the row where the name is not null

This is the expected output

id	name
1	A
2	A
3	B
4	B
5	A
6	null

I tried this :

df
.orderBy("name", ascending=False)
.dropDuplicates(["id"])
.show(10, False)

It remove the duplicates rows but I get null values in the "name" column

Thanks in advance for helping

Improve this question

edited Apr 29, 2024 at 12:08

s.polam's user avatar

s.polam

10.4k2 gold badges17 silver badges29 bronze badges

asked Apr 29, 2024 at 11:48

nbs335's user avatar

nbs335

93 bronze badges

You can add filter to get not null values after dropDuplicates

s.polam
– s.polam

2024年04月29日 12:08:28 +00:00
Commented Apr 29, 2024 at 12:08
Yes but for some IDs like the "6" I just have 1 row with null name and I don't want to remove this line because it's not a duplicate row

nbs335
– nbs335

2024年04月29日 12:44:20 +00:00
Commented Apr 29, 2024 at 12:44

Add a comment |

1 Answer 1

Sorted by: Reset to default

from pyspark.sql.functions import col, count, lit

schema_list = ['id','name']

data_list = [(1,'A'),(1,'null'),(2,'A'),(3,'B'),(4,'null'),(4,'B'),(5,'A'),(6,'null')]

Create the data frame

df = spark.createDataFrame(data_list,schema_list)

Create dataframe with ids which appear more than once

df_mult_id = df.groupBy('id').agg(count('name').alias('count')).filter(col('count')>1).select('id')

Create dataframe with ids which have name as null

df_null = df.filter(col('name')=='null').select('id')

Find common id between above both and putting null in name column

mult_null = df_null.intersect(df_mult_id).withColumn('name',lit('null'))

Subtract from original dataframe to get result

result = df.subtract(mult_null)

result.orderBy('id').show(20,False)

Improve this answer

answered Apr 29, 2024 at 15:50

Vandana D's user avatar

Vandana D

664 bronze badges

Comments

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

Pyspark drop duplicates keep the non null row

1 Answer 1

Create the data frame

Create dataframe with ids which appear more than once

Create dataframe with ids which have name as null

Find common id between above both and putting null in name column

Subtract from original dataframe to get result

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Create the data frame

Create dataframe with ids which appear more than once

Create dataframe with ids which have name as null

Find common id between above both and putting null in name column

Subtract from original dataframe to get result

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related