1
\$\begingroup\$

In Python using Pandas, I am splitting a dataset column into 4 lists based on the suffix of the values. For the 3 suffixes I am using a list comprehension then for the 4th one, a set operation that substracts the 3 lists from the original list with all values:

import pandas as pd
df = pd.DataFrame({
 "alcohol_by_volume": [],
 "barcode": [],
 "calcium_per_hundred": [],
 "calcium_unit": [],
 "carbohydrates_per_hundred": [],
 "carbohydrates_per_portion": [],
 "carbohydrates_unit": [],
 "cholesterol_per_hundred": [],
 "cholesterol_unit": [],
 "copper_cu_per_hundred": [],
 "copper_cu_unit": [],
 "country": [],
 "created_at": [],
 "energy_kcal_per_hundred": [],
 "energy_kcal_per_portion": [],
 "energy_kcal_unit": [],
 "energy_per_hundred": [],
 "energy_per_portion": [],
 "energy_unit": [],
 "fat_per_hundred": [],
 "fat_per_portion": [],
 "fat_unit": [],
 "fatty_acids_total_saturated_per_hundred": [],
 "fatty_acids_total_saturated_unit": [],
 "fatty_acids_total_trans_per_hundred": [],
 "fatty_acids_total_trans_unit": [],
 "fiber_insoluble_per_hundred": [],
 "fiber_insoluble_unit": [],
 "fiber_per_hundred": [],
 "fiber_per_portion": [],
 "fiber_soluble_per_hundred": [],
 "fiber_soluble_unit": [],
 "fiber_unit": [],
 "folate_total_per_hundred": [],
 "folate_total_unit": [],
 "folic_acid_per_hundred": [],
 "folic_acid_unit": [],
 "hundred_unit": [],
 "id": [],
 "ingredients_en": [],
 "iron_per_hundred": [],
 "iron_unit": [],
 "magnesium_per_hundred": [],
 "magnesium_unit": [],
 "manganese_mn_per_hundred": []
})
colnames_all = df.columns.to_list()
colnames_unit = [n for n in colnames_all if n.endswith("_unit")]
colnames_per_hundred = [n for n in colnames_all if n.endswith("_per_hundred")]
colnames_per_portion = [n for n in colnames_all if n.endswith("_per_portion")]
colnames_other = list(
 set(colnames_all) - set(colnames_unit + colnames_per_hundred + colnames_per_portion)
)

Expected result (2 examples, other 2 lists are similar to 1st one):

colnames_unit:
['calcium_unit',
 'carbohydrates_unit',
 'cholesterol_unit',
 'copper_cu_unit',
 'energy_kcal_unit',
 'energy_unit',
 'fat_unit',
 'fatty_acids_total_saturated_unit',
 'fatty_acids_total_trans_unit',
 'fiber_insoluble_unit',
 'fiber_soluble_unit',
 'fiber_unit',
 'folate_total_unit',
 'folic_acid_unit',
 'hundred_unit',
 'iron_unit',
 'magnesium_unit']
 colnames_other:
 ['ingredients_en',
 'country',
 'id',
 'created_at',
 'barcode',
 'alcohol_by_volume']

However this does not look like the best way to do this. Is there a "better" way, i.e. shorter and/or more elegant/idiomatic?

asked Jul 3, 2023 at 8:06
\$\endgroup\$
2
  • 1
    \$\begingroup\$ It's hard to review this small fragment in isolation. It would be better to present a complete function, with its unit tests (or at least some sample input to illustrate it). \$\endgroup\$ Commented Jul 3, 2023 at 9:40
  • 1
    \$\begingroup\$ @TobySpeight Added full code for repro. I did not define a function for this, maybe this is already part of the better way to do it...? The code repetition and the set subtraction don't look good to me. \$\endgroup\$ Commented Jul 3, 2023 at 17:42

2 Answers 2

1
\$\begingroup\$
colnames_all = df.columns.to_list()

I don't see a clear need for this. We could simply refer to df.columns instead.


list(
 set(colnames_all) - set(colnames_unit + colnames_per_hundred + colnames_per_portion)
)

That doesn't seem so bad, to me. Certainly the intent is clear.


colnames_unit = [n for n in colnames_all if n.endswith("_unit")]

Consider rephrasing this as

colnames_unit = [n for n in colnames_all if re.search(r'_unit$', n)]

That lets us generalize in this way:

colnames_measured = [n for n in df.columns if re.search(r'_(unit|per_hundred|per_portion)$', n)]

To find the inverse:

colnames_other = [n for n in df.columns if not re.search(r'_(unit|per_hundred|per_portion)$', n)]
answered Jul 3, 2023 at 19:24
\$\endgroup\$
1
  • \$\begingroup\$ Maybe it is not clear but I need a different list by suffix, so colnames_measured does not work for me. Also as it is a constant known suffix, endswith() seems ok, this is not the issue. The question is should this be a 10'000 list with say a collection of 100 suffixes, what would be the best way to address this. Thank you for your input though. \$\endgroup\$ Commented Jul 3, 2023 at 20:03
1
\$\begingroup\$

Don't use comprehensions. Don't use lists. Don't use sets. Use Pandas string vectorisation:

colnames_all = df.columns
is_unit = colnames_all.str.endswith("_unit")
is_hundred = colnames_all.str.endswith("_per_hundred")
is_portion = colnames_all.str.endswith("_per_portion")
colnames_unit = colnames_all[is_unit]
colnames_per_hundred = colnames_all[is_hundred]
colnames_per_portion = colnames_all[is_portion]
colnames_other = colnames_all[~(is_unit | is_hundred | is_portion)]
print(colnames_other)
Index(['alcohol_by_volume', 'barcode', 'country', 'created_at', 'id',
 'ingredients_en'],
 dtype='object')
answered Jul 3, 2023 at 22:29
\$\endgroup\$
2
  • \$\begingroup\$ Why are you advising against lists, is it because of readability, performance, or other reason...? This SO Q&A goes through some detailed and interesting points about this. I like this method, though. I'm thinking a filtering function may be the best way to avoid code repetition. \$\endgroup\$ Commented Jul 4, 2023 at 8:42
  • \$\begingroup\$ It's not in the Pandas style, and (though it matters more for large input) vectorized operations will be faster \$\endgroup\$ Commented Jul 4, 2023 at 12:53

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.