In Python using Pandas, I am splitting a dataset column into 4 lists based on the suffix of the values. For the 3 suffixes I am using a list comprehension then for the 4th one, a set operation that substracts the 3 lists from the original list with all values:
import pandas as pd
df = pd.DataFrame({
"alcohol_by_volume": [],
"barcode": [],
"calcium_per_hundred": [],
"calcium_unit": [],
"carbohydrates_per_hundred": [],
"carbohydrates_per_portion": [],
"carbohydrates_unit": [],
"cholesterol_per_hundred": [],
"cholesterol_unit": [],
"copper_cu_per_hundred": [],
"copper_cu_unit": [],
"country": [],
"created_at": [],
"energy_kcal_per_hundred": [],
"energy_kcal_per_portion": [],
"energy_kcal_unit": [],
"energy_per_hundred": [],
"energy_per_portion": [],
"energy_unit": [],
"fat_per_hundred": [],
"fat_per_portion": [],
"fat_unit": [],
"fatty_acids_total_saturated_per_hundred": [],
"fatty_acids_total_saturated_unit": [],
"fatty_acids_total_trans_per_hundred": [],
"fatty_acids_total_trans_unit": [],
"fiber_insoluble_per_hundred": [],
"fiber_insoluble_unit": [],
"fiber_per_hundred": [],
"fiber_per_portion": [],
"fiber_soluble_per_hundred": [],
"fiber_soluble_unit": [],
"fiber_unit": [],
"folate_total_per_hundred": [],
"folate_total_unit": [],
"folic_acid_per_hundred": [],
"folic_acid_unit": [],
"hundred_unit": [],
"id": [],
"ingredients_en": [],
"iron_per_hundred": [],
"iron_unit": [],
"magnesium_per_hundred": [],
"magnesium_unit": [],
"manganese_mn_per_hundred": []
})
colnames_all = df.columns.to_list()
colnames_unit = [n for n in colnames_all if n.endswith("_unit")]
colnames_per_hundred = [n for n in colnames_all if n.endswith("_per_hundred")]
colnames_per_portion = [n for n in colnames_all if n.endswith("_per_portion")]
colnames_other = list(
set(colnames_all) - set(colnames_unit + colnames_per_hundred + colnames_per_portion)
)
Expected result (2 examples, other 2 lists are similar to 1st one):
colnames_unit:
['calcium_unit',
'carbohydrates_unit',
'cholesterol_unit',
'copper_cu_unit',
'energy_kcal_unit',
'energy_unit',
'fat_unit',
'fatty_acids_total_saturated_unit',
'fatty_acids_total_trans_unit',
'fiber_insoluble_unit',
'fiber_soluble_unit',
'fiber_unit',
'folate_total_unit',
'folic_acid_unit',
'hundred_unit',
'iron_unit',
'magnesium_unit']
colnames_other:
['ingredients_en',
'country',
'id',
'created_at',
'barcode',
'alcohol_by_volume']
However this does not look like the best way to do this. Is there a "better" way, i.e. shorter and/or more elegant/idiomatic?
-
1\$\begingroup\$ It's hard to review this small fragment in isolation. It would be better to present a complete function, with its unit tests (or at least some sample input to illustrate it). \$\endgroup\$Toby Speight– Toby Speight2023年07月03日 09:40:53 +00:00Commented Jul 3, 2023 at 9:40
-
1\$\begingroup\$ @TobySpeight Added full code for repro. I did not define a function for this, maybe this is already part of the better way to do it...? The code repetition and the set subtraction don't look good to me. \$\endgroup\$evilmandarine– evilmandarine2023年07月03日 17:42:02 +00:00Commented Jul 3, 2023 at 17:42
2 Answers 2
colnames_all = df.columns.to_list()
I don't see a clear need for this.
We could simply refer to df.columns
instead.
list(
set(colnames_all) - set(colnames_unit + colnames_per_hundred + colnames_per_portion)
)
That doesn't seem so bad, to me. Certainly the intent is clear.
colnames_unit = [n for n in colnames_all if n.endswith("_unit")]
Consider rephrasing this as
colnames_unit = [n for n in colnames_all if re.search(r'_unit$', n)]
That lets us generalize in this way:
colnames_measured = [n for n in df.columns if re.search(r'_(unit|per_hundred|per_portion)$', n)]
To find the inverse:
colnames_other = [n for n in df.columns if not re.search(r'_(unit|per_hundred|per_portion)$', n)]
-
\$\begingroup\$ Maybe it is not clear but I need a different list by suffix, so
colnames_measured
does not work for me. Also as it is a constant known suffix,endswith()
seems ok, this is not the issue. The question is should this be a 10'000 list with say a collection of 100 suffixes, what would be the best way to address this. Thank you for your input though. \$\endgroup\$evilmandarine– evilmandarine2023年07月03日 20:03:26 +00:00Commented Jul 3, 2023 at 20:03
Don't use comprehensions. Don't use lists. Don't use sets. Use Pandas string vectorisation:
colnames_all = df.columns
is_unit = colnames_all.str.endswith("_unit")
is_hundred = colnames_all.str.endswith("_per_hundred")
is_portion = colnames_all.str.endswith("_per_portion")
colnames_unit = colnames_all[is_unit]
colnames_per_hundred = colnames_all[is_hundred]
colnames_per_portion = colnames_all[is_portion]
colnames_other = colnames_all[~(is_unit | is_hundred | is_portion)]
print(colnames_other)
Index(['alcohol_by_volume', 'barcode', 'country', 'created_at', 'id',
'ingredients_en'],
dtype='object')
-
\$\begingroup\$ Why are you advising against lists, is it because of readability, performance, or other reason...? This SO Q&A goes through some detailed and interesting points about this. I like this method, though. I'm thinking a filtering function may be the best way to avoid code repetition. \$\endgroup\$evilmandarine– evilmandarine2023年07月04日 08:42:18 +00:00Commented Jul 4, 2023 at 8:42
-
\$\begingroup\$ It's not in the Pandas style, and (though it matters more for large input) vectorized operations will be faster \$\endgroup\$Reinderien– Reinderien2023年07月04日 12:53:59 +00:00Commented Jul 4, 2023 at 12:53