Using Pandas to group data based on name and see if column value is greater than or equal to values based on group names

Question 1

As you'll see from the below code, I'm creating separate data frames of a much larger data frame, then updating a column for each one. What I'm doing is looking at the second column and checking to see if it meets a certain threshold value that is associated with the brand name found within the first column that has each individual hotel's name (part of that whole hotel name contains the specific brand name so I use .contains to group each of these hotels by brand name into specific data frame). Finally, I merge all of them together to get the original data frame with the updated column "Meeting/Not Meeting" column. When I've tried making a change only to the "splice" of the data frame associated with the rows that I'm interested in applying a certain threshold comparison to, it overwrites the previous values of the column with blank values, so this was the best solution I could come up with my limited knowledge.

Another thing I want to improve is the repetitiveness of the below code. I'm rewriting so much it seems and I'm sure there is a much better way of doing the following.

Here's an example of the data: sample data

Raw text if that's more helpful:

Property| Overall Experience| Meeting /Not Meeting

Hotel Portland 76
Hotel Suites Los Angeles 49
Hotel Suites Colorado Springs 80
Hotel Premium Denver 82
Hotel Premium Miami Extended Stay 70
Extended Stay Phoenix 65
Hotel Suites Redwood Extended Stay 81
Nexus Santa Barbara 92
Plaza by the Sea 77
Majestic By Hilltop 90
Majestic San Francisco 85
Hotel Seattle 56

import pandas as pd
from google.colab import files
# file uploaded here and name extracted
uploaded = files.upload()
filename = next(iter(uploaded))
df = pd.read_excel(filename)
# separate dfs for each brand name since they have their own threshold value for Overall Experience guest rating
hotel_core_df = df.loc[(df['Property'].str.contains('Hotel')) & ~(df['Property'].str.contains('Hotel Suites')) & ~(df['Property'].str.contains('Hotel Premium'))].copy()
hotel_core_df['Meeting /Not Meeting'] = hotel_core_df['Overall Experience'] >= 72.10
# if Extended Stay in name - treat as Extended Stay brand even if Suites or Premium property
hotel_suites_df = df.loc[df['Property'].str.contains('Hotel Suites') & ~(df['Property'].str.contains('Extended Stay'))].copy()
hotel_suites_df['Meeting /Not Meeting'] = hotel_suites_df['Overall Experience'] >= 74.65
hotel_prem_df = df.loc[df['Property'].str.contains('Hotel Premium') & ~(df['Property'].str.contains('Extended Stay'))].copy()
hotel_prem_df['Meeting /Not Meeting'] = hotel_prem_df['Overall Experience'] >= 75.71
nexus_df = df.loc[df['Property'].str.contains('Nexus')].copy()
nexus_df['Meeting /Not Meeting'] = nexus_df['Overall Experience'] >= 74.82
plaza_df = df.loc[df['Property'].str.contains('Plaza')].copy()
plaza_df['Meeting /Not Meeting'] = plaza_df['Overall Experience'] >= 73.82
extend_df = df.loc[df['Property'].str.contains('Extended Stay')].copy()
extend_df['Meeting /Not Meeting'] = extend_df['Overall Experience'] >= 74.82
majestic_df = df.loc[df['Property'].str.contains('Majestic')].copy()
majestic_df['Meeting /Not Meeting'] = majestic_df['Overall Experience'] >= 75.78
merged_df = pd.concat([hotel_core_df, hotel_suites_df, hotel_prem_df, nexus_df, plaza_df, extend_df, majestic_df])
merged_df.sort_values(by=['Overall Experience'], ascending=False, inplace=True)
merged_df['Meeting /Not Meeting'] = merged_df.apply(lambda x: 'Meeting' if x['Meeting /Not Meeting'] == True else 'Not Meeting',axis=1)
 
merged_df.to_excel('OE threshold.xlsx', index=False)

Question 2

Can you show an example data file? Otherwise we will not be able to run this.

Question 3

Based on the actual data, you may or may not need to break vectorisation and do a looped or mapped regular expression match. But it's impossible to say without seeing the real thing.

Question 4

I added an example of the data above.

Question 5

Plaza Hotel by the Sea is ambiguous. Which record should it match?

Question 6

Good catch. I made a mistake. Obviously I didn't want to use real data and names so I typed that sample out myself. Neither Plaza Hotel or Nexus Hotel should contain Hotel. I updated it above. Hopefully, I didn't make any other errors.

Question 7

You'll need to rethink your required experience criteria as a merge between a criteria table and a hotel records table. For your existing data, the easiest way to do this is a priority substring match. This is less repetitive than what you have now, but I have not tested its performance.

Also, you should probably skip this altogether:

merged_df['Meeting /Not Meeting'] = merged_df.apply(lambda x: 'Meeting' if x['Meeting /Not Meeting'] == True else 'Not Meeting',axis=1)

Real booleans are better than strings.

Suggested

import pandas as pd
df = pd.DataFrame({
 'Property': [
 'Hotel', 'Hotel Suites', 'Hotel Premium', 'Nexus', 'Plaza', 'Extended Stay',
 'Majestic', 'Majestic', 'Hotel Portland', 'Hotel Suites Los Angeles',
 'Hotel Suites Colorado Springs', 'Hotel Premium Denver',
 'Hotel Premium Miami Extended Stay', 'Extended Stay Phoenix',
 'Hotel Suites Redwood Extended Stay', 'Nexus Santa Barbara',
 'Plaza by the Sea', 'Majestic By Hilltop', 'Majestic San Francisco',
 'Hotel Seattle',
 ],
 'Overall Experience': [
 75, 74, 74, 76, 76, 76, 71, 77,
 76, 49, 80, 82, 70, 65, 81, 92,
 77, 90, 85, 56],
})
df.index.name = 'record'
# Decreasing order of priority
criteria = pd.DataFrame({
 'required_experience': [74.82, 74.65, 75.71, 74.82, 73.82, 75.78, 72.10],
 'property_term': ['extended', 'suites', 'premium', 'nexus', 'plaza', 'majestic', 'hotel'],
})
criteria.index.name = 'priority'
# Long-form series of hotel term words, with nulls dropped
terms = df.Property.str.lower().str.split(expand=True).stack()
terms.name = 'property_term'
terms.index.names = ('record', 'term_index')
# Merged dataframe between terms and criteria, with ambiguity resolved by taking first term in priority sequence
keep_cols = ['property_term', 'required_experience']
merged = (
 pd.merge(left=terms.reset_index(), right=criteria.reset_index(), on='property_term')
 .sort_values(by=['record', 'priority'])
 .set_index('record')
 .groupby(level='record').first()
)[keep_cols]
df[keep_cols] = merged
df['Meeting'] = df['Overall Experience'] >= df.required_experience
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 1000)
print(df)

 Property Overall Experience property_term required_experience Meeting
record 
0 Hotel 75 hotel 72.10 True
1 Hotel Suites 74 suites 74.65 False
2 Hotel Premium 74 premium 75.71 False
3 Nexus 76 nexus 74.82 True
4 Plaza 76 plaza 73.82 True
5 Extended Stay 76 extended 74.82 True
6 Majestic 71 majestic 75.78 False
7 Majestic 77 majestic 75.78 True
8 Hotel Portland 76 hotel 72.10 True
9 Hotel Suites Los Angeles 49 suites 74.65 False
10 Hotel Suites Colorado Springs 80 suites 74.65 True
11 Hotel Premium Denver 82 premium 75.71 True
12 Hotel Premium Miami Extended Stay 70 extended 74.82 False
13 Extended Stay Phoenix 65 extended 74.82 False
14 Hotel Suites Redwood Extended Stay 81 extended 74.82 True
15 Nexus Santa Barbara 92 nexus 74.82 True
16 Plaza by the Sea 77 plaza 73.82 True
17 Majestic By Hilltop 90 majestic 75.78 True
18 Majestic San Francisco 85 majestic 75.78 True
19 Hotel Seattle 56 hotel 72.10 False

Question 8

Thank you. I will test this out. This already seems a lot better that what I was doing and will help me avoid repetition with future tasks. I'm sure there are easier ways of doing this directly in Excel, but I haven't worked with it as much so Python seemed like a good alternative. I completely agree with skipping the Boolean to string conversion. The only reason I did that is because the string format is what the rest of the team I'm sharing this data with will likely be expecting.

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Answer 1 · 2023-07-06 22:04:24Z

You'll need to rethink your required experience criteria as a merge between a criteria table and a hotel records table. For your existing data, the easiest way to do this is a priority substring match. This is less repetitive than what you have now, but I have not tested its performance.

Also, you should probably skip this altogether:

merged_df['Meeting /Not Meeting'] = merged_df.apply(lambda x: 'Meeting' if x['Meeting /Not Meeting'] == True else 'Not Meeting',axis=1)

Real booleans are better than strings.

Suggested

import pandas as pd
df = pd.DataFrame({
 'Property': [
 'Hotel', 'Hotel Suites', 'Hotel Premium', 'Nexus', 'Plaza', 'Extended Stay',
 'Majestic', 'Majestic', 'Hotel Portland', 'Hotel Suites Los Angeles',
 'Hotel Suites Colorado Springs', 'Hotel Premium Denver',
 'Hotel Premium Miami Extended Stay', 'Extended Stay Phoenix',
 'Hotel Suites Redwood Extended Stay', 'Nexus Santa Barbara',
 'Plaza by the Sea', 'Majestic By Hilltop', 'Majestic San Francisco',
 'Hotel Seattle',
 ],
 'Overall Experience': [
 75, 74, 74, 76, 76, 76, 71, 77,
 76, 49, 80, 82, 70, 65, 81, 92,
 77, 90, 85, 56],
})
df.index.name = 'record'
# Decreasing order of priority
criteria = pd.DataFrame({
 'required_experience': [74.82, 74.65, 75.71, 74.82, 73.82, 75.78, 72.10],
 'property_term': ['extended', 'suites', 'premium', 'nexus', 'plaza', 'majestic', 'hotel'],
})
criteria.index.name = 'priority'
# Long-form series of hotel term words, with nulls dropped
terms = df.Property.str.lower().str.split(expand=True).stack()
terms.name = 'property_term'
terms.index.names = ('record', 'term_index')
# Merged dataframe between terms and criteria, with ambiguity resolved by taking first term in priority sequence
keep_cols = ['property_term', 'required_experience']
merged = (
 pd.merge(left=terms.reset_index(), right=criteria.reset_index(), on='property_term')
 .sort_values(by=['record', 'priority'])
 .set_index('record')
 .groupby(level='record').first()
)[keep_cols]
df[keep_cols] = merged
df['Meeting'] = df['Overall Experience'] >= df.required_experience
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 1000)
print(df)

 Property Overall Experience property_term required_experience Meeting
record 
0 Hotel 75 hotel 72.10 True
1 Hotel Suites 74 suites 74.65 False
2 Hotel Premium 74 premium 75.71 False
3 Nexus 76 nexus 74.82 True
4 Plaza 76 plaza 73.82 True
5 Extended Stay 76 extended 74.82 True
6 Majestic 71 majestic 75.78 False
7 Majestic 77 majestic 75.78 True
8 Hotel Portland 76 hotel 72.10 True
9 Hotel Suites Los Angeles 49 suites 74.65 False
10 Hotel Suites Colorado Springs 80 suites 74.65 True
11 Hotel Premium Denver 82 premium 75.71 True
12 Hotel Premium Miami Extended Stay 70 extended 74.82 False
13 Extended Stay Phoenix 65 extended 74.82 False
14 Hotel Suites Redwood Extended Stay 81 extended 74.82 True
15 Nexus Santa Barbara 92 nexus 74.82 True
16 Plaza by the Sea 77 plaza 73.82 True
17 Majestic By Hilltop 90 majestic 75.78 True
18 Majestic San Francisco 85 majestic 75.78 True
19 Hotel Seattle 56 hotel 72.10 False

Thank you. I will test this out. This already seems a lot better that what I was doing and will help me avoid repetition with future tasks. I'm sure there are easier ways of doing this directly in Excel, but I haven't worked with it as much so Python seemed like a good alternative. I completely agree with skipping the Boolean to string conversion. The only reason I did that is because the string format is what the rest of the team I'm sharing this data with will likely be expecting.

Stack Exchange Network

Using Pandas to group data based on name and see if column value is greater than or equal to values based on group names

1 Answer 1

Suggested

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Using Pandas to group data based on name and see if column value is greater than or equal to values based on group names

1 Answer 1

Suggested

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions