0

I have a pandas dataframe that I want to manipulate. Here's an example of the data:

enter image description here

As you can see we have 3 columns. Column proteins has multiple elements separated with a comma, whereas column term description only has a single element per row. My aim is to reverse this and have a column with single elements from proteins and another column with multiple elements from term description. To explain this by an example if proteins CYP51A1 fall under the term description metabolic process and organic substance metabolic process I want my dataframe to look like this:

protein_name | term description
---------------------------------------------------------------------
CYP51A1 | metabolic process, organic substance metabolic process
etc.

i hope i explained this good enough! thanks for your help!

asked Feb 3, 2022 at 15:36

1 Answer 1

2

You can achieve it via pandas explode and apply methods.

Let's create a sample dataframe first.

df1 = pd.DataFrame.from_dict({'term description': ['metabolic process', 'organic substance metabolic process', 'metabolic process a', 'metabolic process b', 'metabolic process c'],
 'false discovery rate': [1.01, 1.001, 1.02, 1.03, 1.04],
 'proteins': ['CYP51A1,CPA1,STK10', 'CYP51A1,CPA1,AAA', 'CPA1,AAA,BBB,CCC', 'AAA,BBB,CCC,DDD', 'AAA,CCC,EEE,FFF']
 })
# dataframe df1
 term description false discovery rate proteins
0 metabolic process 1.010 CYP51A1,CPA1,STK10
1 organic substance metabolic process 1.001 CYP51A1,CPA1,AAA
2 metabolic process a 1.020 CPA1,AAA,BBB,CCC
3 metabolic process b 1.030 AAA,BBB,CCC,DDD
4 metabolic process c 1.040 AAA,CCC,EEE,FFF

Let's split the proteins column to a list, so that we can explode it.

df1['proteins'] = df1['proteins'].apply(lambda x: x.split(','))
df1 = df1.explode('proteins')
# dataframe df1 
 term description false discovery rate proteins
0 metabolic process 1.010 CYP51A1
0 metabolic process 1.010 CPA1
0 metabolic process 1.010 STK10
1 organic substance metabolic process 1.001 CYP51A1
1 organic substance metabolic process 1.001 CPA1
1 organic substance metabolic process 1.001 AAA
2 metabolic process a 1.020 CPA1
2 metabolic process a 1.020 AAA
2 metabolic process a 1.020 BBB
2 metabolic process a 1.020 CCC
3 metabolic process b 1.030 AAA
3 metabolic process b 1.030 BBB
3 metabolic process b 1.030 CCC
3 metabolic process b 1.030 DDD
4 metabolic process c 1.040 AAA
4 metabolic process c 1.040 CCC
4 metabolic process c 1.040 EEE
4 metabolic process c 1.040 FFF

Now we'll combine the values under 'term description' that belongs to the same protein.

df2 = df1.groupby('proteins')['term description'].apply(list).reset_index()
# dataframe df2
 proteins term description
0 AAA [organic substance metabolic process, metaboli...
1 BBB [metabolic process a, metabolic process b]
2 CCC [metabolic process a, metabolic process b, met...
3 CPA1 [metabolic process, organic substance metaboli...
4 CYP51A1 [metabolic process, organic substance metaboli...
5 DDD [metabolic process b]
6 EEE [metabolic process c]
7 FFF [metabolic process c]
8 STK10 [metabolic process]

Now, all we need to do is to apply a lambda that'd modify the 'proteins' column values as per our requirements. I'm adding a sample one based on what you mentioned. You can add multiple conditions inside this method as you need.

def modifier(protein, term_descrip):
 if protein == 'CYP51A1' and set(term_descrip).intersection({'metabolic process', 'organic substance metabolic process'}):
 return 'CYP51A1 etc.'
 # add more if conditions as required
df2['proteins'] = df2.apply(lambda row: modifier(row['proteins'], row['term description']), axis=1)
# dataframe df2
 proteins term description
0 None [organic substance metabolic process, metaboli...
1 None [metabolic process a, metabolic process b]
2 None [metabolic process a, metabolic process b, met...
3 None [metabolic process, organic substance metaboli...
4 CYP51A1 etc. [metabolic process, organic substance metaboli...
5 None [metabolic process b]
6 None [metabolic process c]
7 None [metabolic process c]
8 None [metabolic process]
answered Feb 3, 2022 at 16:31
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.