I have a pandas dataframe that I want to manipulate. Here's an example of the data:
As you can see we have 3 columns. Column proteins has multiple elements separated with a comma, whereas column term description only has a single element per row. My aim is to reverse this and have a column with single elements from proteins and another column with multiple elements from term description. To explain this by an example if proteins CYP51A1 fall under the term description metabolic process and organic substance metabolic process I want my dataframe to look like this:
protein_name | term description
---------------------------------------------------------------------
CYP51A1 | metabolic process, organic substance metabolic process
etc.
i hope i explained this good enough! thanks for your help!
1 Answer 1
You can achieve it via pandas explode and apply methods.
Let's create a sample dataframe first.
df1 = pd.DataFrame.from_dict({'term description': ['metabolic process', 'organic substance metabolic process', 'metabolic process a', 'metabolic process b', 'metabolic process c'],
'false discovery rate': [1.01, 1.001, 1.02, 1.03, 1.04],
'proteins': ['CYP51A1,CPA1,STK10', 'CYP51A1,CPA1,AAA', 'CPA1,AAA,BBB,CCC', 'AAA,BBB,CCC,DDD', 'AAA,CCC,EEE,FFF']
})
# dataframe df1
term description false discovery rate proteins
0 metabolic process 1.010 CYP51A1,CPA1,STK10
1 organic substance metabolic process 1.001 CYP51A1,CPA1,AAA
2 metabolic process a 1.020 CPA1,AAA,BBB,CCC
3 metabolic process b 1.030 AAA,BBB,CCC,DDD
4 metabolic process c 1.040 AAA,CCC,EEE,FFF
Let's split the proteins column to a list, so that we can explode it.
df1['proteins'] = df1['proteins'].apply(lambda x: x.split(','))
df1 = df1.explode('proteins')
# dataframe df1
term description false discovery rate proteins
0 metabolic process 1.010 CYP51A1
0 metabolic process 1.010 CPA1
0 metabolic process 1.010 STK10
1 organic substance metabolic process 1.001 CYP51A1
1 organic substance metabolic process 1.001 CPA1
1 organic substance metabolic process 1.001 AAA
2 metabolic process a 1.020 CPA1
2 metabolic process a 1.020 AAA
2 metabolic process a 1.020 BBB
2 metabolic process a 1.020 CCC
3 metabolic process b 1.030 AAA
3 metabolic process b 1.030 BBB
3 metabolic process b 1.030 CCC
3 metabolic process b 1.030 DDD
4 metabolic process c 1.040 AAA
4 metabolic process c 1.040 CCC
4 metabolic process c 1.040 EEE
4 metabolic process c 1.040 FFF
Now we'll combine the values under 'term description' that belongs to the same protein.
df2 = df1.groupby('proteins')['term description'].apply(list).reset_index()
# dataframe df2
proteins term description
0 AAA [organic substance metabolic process, metaboli...
1 BBB [metabolic process a, metabolic process b]
2 CCC [metabolic process a, metabolic process b, met...
3 CPA1 [metabolic process, organic substance metaboli...
4 CYP51A1 [metabolic process, organic substance metaboli...
5 DDD [metabolic process b]
6 EEE [metabolic process c]
7 FFF [metabolic process c]
8 STK10 [metabolic process]
Now, all we need to do is to apply a lambda that'd modify the 'proteins' column values as per our requirements. I'm adding a sample one based on what you mentioned. You can add multiple conditions inside this method as you need.
def modifier(protein, term_descrip):
if protein == 'CYP51A1' and set(term_descrip).intersection({'metabolic process', 'organic substance metabolic process'}):
return 'CYP51A1 etc.'
# add more if conditions as required
df2['proteins'] = df2.apply(lambda row: modifier(row['proteins'], row['term description']), axis=1)
# dataframe df2
proteins term description
0 None [organic substance metabolic process, metaboli...
1 None [metabolic process a, metabolic process b]
2 None [metabolic process a, metabolic process b, met...
3 None [metabolic process, organic substance metaboli...
4 CYP51A1 etc. [metabolic process, organic substance metaboli...
5 None [metabolic process b]
6 None [metabolic process c]
7 None [metabolic process c]
8 None [metabolic process]