How to modify large amounts of data using pandas

Question 1

As shown in this code, I need to use certain data in one table as a basis to modify another table and add some information. When this kind of table information scale is large, this violent traversal method is very inefficient. How should I modify it? Moreover, multiple sheets need to be compared, which makes the efficiency even lower.

df = pd.DataFrame({'id': [123, 321, 456, 543], 'name': ['xxx', 'yyy', 'zzz', 'www']})
df.set_index('id', inplace=True)
df_1 = pd.DataFrame({'id': [123, 321, 456, 543], 'name': ['xxx', 'yyy', 'zzz', 'www'], 'complete': ['yes', 'yes', 'yes', 'yes'], 'course_name':['AA', 'BB', 'AA', 'DD'], 'complete_date': ['1.1', '1.2', '1.1', '1.5']})
df_1.set_index('id', inplace=True)
group_df = df_1.groupby('course_name')
info = dict()
for course_name, course_df in group_df:
 info[course_name]=[]
 def process(row):
 info[course_name].append(Subscriber(*row.tolist()))
 get_info = course_df.loc[course_df["complete"] == "yes"]
 get_columns = ['name', 'complete_date']
 finish_df = get_info[get_columns]
 Subscriber = namedtuple('Subscriber', ['name', 'complete_date'])
 finish_df.apply(process, axis = 1)
# print(info)
# {'AA': [Subscriber(name='xxx', complete='yes'), Subscriber(name='zzz', complete='yes')], 'BB': [Subscriber(name='yyy', complete='yes')], 'DD': [Subscriber(name='www', complete='yes')]}
'''modify df'''
names = set(df['name'])
for course in info.keys():
 for name, date in info[course]:
 if name in names:
 df.loc[df['name'] == name, course] = date + 'yes' 
 pass
# print(df)

Question 2

Welcome here, with the current output, it looks like a pivot would do the job, something like

df_1.reset_index().assign(val=lambda x: x['complete_date']+'yes').pivot(index=['id', 'name'], columns='course_name', values='val')

plus a bit of cosmetic. But I assume your real case is a bit more complex, could you elaborate if this solution does not fill your need?

Question 3

Thank you very much and sorry for the late reply. Yes, the requirements will be a little more complicated than this, because the table is used to maintain personnel information, and subsequent information will be used to add to the original table instead of overwriting it. I'll think about how to accomplish it,

Question 4

As Ben.T is pointing out in the comments, a pivot approach would do the job. Here is an example of how you can do this:

import pandas as pd
from collections import namedtuple
df = pd.DataFrame({'id': [123, 321, 456, 543], 'name': ['xxx', 'yyy', 'zzz', 'www']})
df_1 = pd.DataFrame({'id': [123, 321, 456, 543], 'name': ['xxx', 'yyy', 'zzz', 'www'], 'complete': ['yes', 'yes', 'yes', 'yes'], 'course_name': ['AA', 'BB', 'AA', 'DD'], 'complete_date': ['1.1', '1.2', '1.1', '1.5']})
completed_courses = df_1[df_1['complete'] == 'yes']
pivot_df = completed_courses.pivot_table(index='id', columns='course_name', values='complete_date', aggfunc='first')
result_df = df.set_index('id').join(pivot_df, on='id')
result_df = result_df.fillna('')
for col in pivot_df.columns:
 result_df[col] = result_df[col].apply(lambda x: 'yes' if x != '' else x)
result_df.reset_index(inplace=True)
print(result_df)

which gives you

 id name AA BB DD
0 123 xxx yes 
1 321 yyy yes 
2 456 zzz yes 
3 543 www yes

Serge de Gosson de Varennes 11.6k4 gold badges30 silver badges60 bronze badges · Accepted Answer · 2024-05-25 05:38:20Z

As Ben.T is pointing out in the comments, a pivot approach would do the job. Here is an example of how you can do this:

import pandas as pd
from collections import namedtuple
df = pd.DataFrame({'id': [123, 321, 456, 543], 'name': ['xxx', 'yyy', 'zzz', 'www']})
df_1 = pd.DataFrame({'id': [123, 321, 456, 543], 'name': ['xxx', 'yyy', 'zzz', 'www'], 'complete': ['yes', 'yes', 'yes', 'yes'], 'course_name': ['AA', 'BB', 'AA', 'DD'], 'complete_date': ['1.1', '1.2', '1.1', '1.5']})
completed_courses = df_1[df_1['complete'] == 'yes']
pivot_df = completed_courses.pivot_table(index='id', columns='course_name', values='complete_date', aggfunc='first')
result_df = df.set_index('id').join(pivot_df, on='id')
result_df = result_df.fillna('')
for col in pivot_df.columns:
 result_df[col] = result_df[col].apply(lambda x: 'yes' if x != '' else x)
result_df.reset_index(inplace=True)
print(result_df)

which gives you

 id name AA BB DD
0 123 xxx yes 
1 321 yyy yes 
2 456 zzz yes 
3 543 www yes

CollectivesTM on Stack Overflow

How to modify large amounts of data using pandas

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related