-1

Hi I have a dataset like this :

id name
1 A
1 null
2 A
3 B
4 null
4 B
5 A
6 null

And I want to remove duplicates row and keep the row where the name is not null

This is the expected output

id name
1 A
2 A
3 B
4 B
5 A
6 null

I tried this :

df
.orderBy("name", ascending=False)
.dropDuplicates(["id"])
.show(10, False)

It remove the duplicates rows but I get null values in the "name" column

Thanks in advance for helping

s.polam
10.4k2 gold badges17 silver badges29 bronze badges
asked Apr 29, 2024 at 11:48
2
  • You can add filter to get not null values after dropDuplicates Commented Apr 29, 2024 at 12:08
  • Yes but for some IDs like the "6" I just have 1 row with null name and I don't want to remove this line because it's not a duplicate row Commented Apr 29, 2024 at 12:44

1 Answer 1

0

from pyspark.sql.functions import col, count, lit

schema_list = ['id','name']

data_list = [(1,'A'),(1,'null'),(2,'A'),(3,'B'),(4,'null'),(4,'B'),(5,'A'),(6,'null')]

Create the data frame

df = spark.createDataFrame(data_list,schema_list)

Create dataframe with ids which appear more than once

df_mult_id = df.groupBy('id').agg(count('name').alias('count')).filter(col('count')>1).select('id')

Create dataframe with ids which have name as null

df_null = df.filter(col('name')=='null').select('id')

Find common id between above both and putting null in name column

mult_null = df_null.intersect(df_mult_id).withColumn('name',lit('null'))

Subtract from original dataframe to get result

result = df.subtract(mult_null)

result.orderBy('id').show(20,False)

answered Apr 29, 2024 at 15:50
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.