Hi I have a dataset like this :
| id | name |
|---|---|
| 1 | A |
| 1 | null |
| 2 | A |
| 3 | B |
| 4 | null |
| 4 | B |
| 5 | A |
| 6 | null |
And I want to remove duplicates row and keep the row where the name is not null
This is the expected output
| id | name |
|---|---|
| 1 | A |
| 2 | A |
| 3 | B |
| 4 | B |
| 5 | A |
| 6 | null |
I tried this :
df
.orderBy("name", ascending=False)
.dropDuplicates(["id"])
.show(10, False)
It remove the duplicates rows but I get null values in the "name" column
Thanks in advance for helping
-
You can add filter to get not null values after dropDuplicatess.polam– s.polam2024年04月29日 12:08:28 +00:00Commented Apr 29, 2024 at 12:08
-
Yes but for some IDs like the "6" I just have 1 row with null name and I don't want to remove this line because it's not a duplicate rownbs335– nbs3352024年04月29日 12:44:20 +00:00Commented Apr 29, 2024 at 12:44
1 Answer 1
from pyspark.sql.functions import col, count, lit
schema_list = ['id','name']
data_list = [(1,'A'),(1,'null'),(2,'A'),(3,'B'),(4,'null'),(4,'B'),(5,'A'),(6,'null')]
Create the data frame
df = spark.createDataFrame(data_list,schema_list)
Create dataframe with ids which appear more than once
df_mult_id = df.groupBy('id').agg(count('name').alias('count')).filter(col('count')>1).select('id')
Create dataframe with ids which have name as null
df_null = df.filter(col('name')=='null').select('id')
Find common id between above both and putting null in name column
mult_null = df_null.intersect(df_mult_id).withColumn('name',lit('null'))
Subtract from original dataframe to get result
result = df.subtract(mult_null)
result.orderBy('id').show(20,False)