Use python/pandas to combine rows where duplicate values exist in one column

Asked 7 years, 5 months ago

Viewed 580 times

I want to use python to determine if the first instance of an ID value in the "Id" column has a match on a later row in that same column. If it does, then I want to take the value from the "Avail" column for the rows which match that initial "Id" value. Then I want to delete the rows with the duplicate Ids.

Here's my sample data: I have a CSV file that has data like this:

Id,First,Last,Avail 
abcdefg,John,Smith,4164667a-5dca-4ec6-a495-4be5b135d868=immediate 
dgasgas,Nancy,Adams,f98a8fbd-fb88-49b9-894e-631ba2a6f369=immediate 
gaytrjhu,John,Smith,e24ddf4c-c79f-4a84-a4ed-d92a10cc9e15=immediate 
abcdefg,John,Smith,3ec0c158-8782-41ff-8388-5a10b9261b60=immediate 
abcdefg,John,Smith,3ec0c158-8782-41ff-8388-c5dfe3b1276c=relative|7

Desired output (v1) (Please note that I don't care about the "First" or "Last" columns from the duplicate rows. I only care about the "Avail" data from those:

Id,First,Last,Avail 
abcdefg,John,Smith,4164667a-5dca-4ec6-a495-4be5b135d868=immediate;3ec0c158-8782-41ff-8388-5a10b9261b60=immediate;3ec0c158-8782-41ff-8388-5a10b9261b60=immediate 
dgasgas,Nancy,Adams,f98a8fbd-fb88-49b9-894e-631ba2a6f369=immediate 
gaytrjhu,John,Smith,e24ddf4c-c79f-4a84-a4ed-d92a10cc9e15=immediate 
abcdefg,Nancy,Adams,3ec0c158-8782-41ff-8388-5a10b9261b60=immediate 
abcdefg,John,Smith,3ec0c158-8782-41ff-8388-c5dfe3b1276c=relative|7

Then I'd like to delete the "duplicate" rows, leaving this:

Id,First,Last,Avail 
 abcdefg,John,Smith,4164667a-5dca-4ec6-a495-4be5b135d868=immediate;3ec0c158-8782-41ff-8388-5a10b9261b60=immediate;3ec0c158-8782-41ff-8388-5a10b9261b60=immediate 
 dgasgas,Nancy,Adams,f98a8fbd-fb88-49b9-894e-631ba2a6f369=immediate 
 gaytrjhu,John,Smith,e24ddf4c-c79f-4a84-a4ed-d92a10cc9e15=immediate

Improve this question

asked Apr 5, 2018 at 23:33

Brock Winfrey's user avatar

Brock Winfrey Brock Winfrey

213 bronze badges

pandas groupby?

Sphinx
– Sphinx

2018年04月05日 23:36:03 +00:00
Commented Apr 5, 2018 at 23:36

Add a comment |

1 Answer 1

Sorted by: Reset to default

import pandas as pd
df = pd.DataFrame(data=[
 [1, 'John', 'Smith', 'a'],
 [1, 'John', 'Smith', 'b'],
 [2, 'Kate', 'Smith', 'c'],
 ],
 columns=['ID', 'First', 'Last', 'Avail']
)
output = (df
 .groupby(['ID', 'First', 'Last'], as_index=False)
 .agg({'Avail': lambda x: ';'.join(x)}))

You can use groupby as @Sphinx suggested. An example with the style of output you requested is above.

Improve this answer

answered Apr 6, 2018 at 0:39

scomes's user avatar

scomes scomes

1,8461 gold badge15 silver badges19 bronze badges

Comments

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

Use python/pandas to combine rows where duplicate values exist in one column

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related