Create new columns based on existing columns

Question 1

The code:

excluded_columns = ['sort_id', 'weekday', 'sort']
for pct_col in sort_df.columns:
 if pct_col in excluded_columns:
 continue
 else:
 sort_df[pct_col + '_pct'] = (sort_df[pct_col] / sort_df['volume'] * 100)

There are no specific concerns I have with this code, it just looks off. I know pandas is quite liberal when it comes to how you choose to achieve the desired outcome, but I just get a funky smell from this code. How could it be more pythonic/pandonic?

Question 2

This code doesn't run. Your brackets are imbalanced. Also, it's probably too small of a snippet to review.

Question 3

Fixed missing bracket. Too small to review? I'm just asking for the method of operating on the dataframe to be reviewed. Should I post the head of the dataframe?

Question 4

The head of the dataframe would help, but really you need to show your whole program for this to be reviewable in context.

Question 5

Posting here is perfectly fine! And it may be difficult to find a more appropriate place on StackExchange, quite honestly. However, this code that you've shown may make sense in some call contexts and far less sense in others. For you to get the most value out of a review, you really should show more context.

Question 6

Some possible input and the expected corresponding output would help reviewers (or better, some of your actual unit tests of the code).

Question 7

Don't loop. Pandas is designed to be vectorised, which means that 99% of the time when you imagine an operation to require a loop, that loop should be baked into some Numpy or Pandas call rather than you writing it out.

In your case, let's invent a raft of totally bogus data since you haven't shown anything real. We then exclude columns via a negated isin, vectorise-divide via div, and concatenate to a new dataframe:

import pandas as pd
from numpy.random import default_rng
rand = default_rng(seed=0)
sort_df = pd.DataFrame(
 rand.uniform(low=0, high=100, size=(100, 7)),
 columns=(
 'sort_id', 'weekday', 'sort', 'volume', 'bananas', 'pears', 'apples',
 )
)
excluded_cols = {'sort_id', 'weekday', 'sort', 'volume'}
to_divide = sort_df.loc[:, ~sort_df.columns.isin(excluded_cols)].add_suffix('_pct')
percents = to_divide.div(sort_df.volume/100, axis='rows')
sort_df = pd.concat((sort_df, percents), axis='columns')

It's likely that volume missing from excluded_cols is an oversight, since it doesn't make sense to divide it by itself and always get 100%.

Question 8

Sorry, I got busy and forgot to post context. This is the perfect answer, very enlightening. Thanks so much.

Question 9

The content of the loop is of the form

if condition:
 continue
else:
 action()

It would probably be easier to read as

if not condition:
 action()

excluded_columns = ['sort_id', 'weekday', 'sort']
for pct_col in sort_df.columns:
 if pct_col not in excluded_columns:
 sort_df[pct_col + '_pct'] = sort_df[pct_col] / sort_df['volume'] * 100

You might want to consider just looping over the set difference of sort_df.columns and excluded_columns, which would eliminate the if inside the loop. That's certainly valuable if that set of columns will be re-used later (which we don't know - that's why a review of such a short fragment of code can be problematic).

Question 10

Loops are not called for here.

Question 11

I've not used numpy in anger, so just addressing readability here. Your advice is better informed than mine!

Question 12

Not a problem; in that context what you've written is certainly useful.

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Accepted Answer · 2022-05-07 15:47:27Z

Don't loop. Pandas is designed to be vectorised, which means that 99% of the time when you imagine an operation to require a loop, that loop should be baked into some Numpy or Pandas call rather than you writing it out.

In your case, let's invent a raft of totally bogus data since you haven't shown anything real. We then exclude columns via a negated isin, vectorise-divide via div, and concatenate to a new dataframe:

import pandas as pd
from numpy.random import default_rng
rand = default_rng(seed=0)
sort_df = pd.DataFrame(
 rand.uniform(low=0, high=100, size=(100, 7)),
 columns=(
 'sort_id', 'weekday', 'sort', 'volume', 'bananas', 'pears', 'apples',
 )
)
excluded_cols = {'sort_id', 'weekday', 'sort', 'volume'}
to_divide = sort_df.loc[:, ~sort_df.columns.isin(excluded_cols)].add_suffix('_pct')
percents = to_divide.div(sort_df.volume/100, axis='rows')
sort_df = pd.concat((sort_df, percents), axis='columns')

It's likely that volume missing from excluded_cols is an oversight, since it doesn't make sense to divide it by itself and always get 100%.

Sorry, I got busy and forgot to post context. This is the perfect answer, very enlightening. Thanks so much.

Stack Exchange Network

Create new columns based on existing columns

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Create new columns based on existing columns

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions