Create new columns in a DataFrame using functions and reposition the new columns

Question 1

I would like a review regarding the method I use to create the new columns and then reposition them in the correct place where they should be.

The new column called total_matched_vls should always be on the right side of the column called total_matched, which in the original CSV is at position 10

The new column called market_matched_vls should always be on the right side of the column called market_matched, which in the original CSV is at position 12

import pandas as pd
def calc_back(df):
 if (df['result'] == 'WINNER'):
 return (df['odds']-1)*0.935
 elif (df['result'] == 'LOSER'):
 return -1
 else:
 return ''
def calc_lay(df):
 if (df['result'] == 'LOSER'):
 return (1/(df['odds']-1))*0.935
 elif (df['result'] == 'WINNER'):
 return -1
 else:
 return ''
def total_matched(df):
 if (df['total_matched'] < 1000):
 return 'centenas'
 elif (df['total_matched'] >= 1000) and (df['total_matched'] < 100000):
 return 'milhares'
 elif (df['total_matched'] >= 100000) and (df['total_matched'] < 1000000):
 return 'centenas de milhares'
 elif (df['total_matched'] >= 1000000):
 return 'milhões'
 else:
 return ''
def market_matched(df):
 if (df['market_matched'] < 1000):
 return 'centenas'
 elif (df['market_matched'] >= 1000) and (df['market_matched'] < 100000):
 return 'milhares'
 elif (df['market_matched'] >= 100000) and (df['market_matched'] < 1000000):
 return 'centenas de milhares'
 elif (df['market_matched'] >= 1000000):
 return 'milhões'
 else:
 return ''
filials = [
 [
 r'C:\Users\Computador\Desktop\csv\history.csv',
 'history_p&l.csv'
 ],
 [
 r'C:\Users\Computador\Desktop\csv\results.csv',
 'results_p&l.csv'
 ]
]
for a,b in filials:
 df = pd.read_csv(a, dtype={'market_id': str})
 df['back'] = df.apply(calc_back, axis=1)
 df['lay'] = df.apply(calc_lay, axis=1)
 df['total_matched_vls'] = df.apply(total_matched, axis=1)
 df['market_matched_vls'] = df.apply(market_matched, axis=1)
 col = df.pop("total_matched_vls")
 df.insert(11, col.name, col)
 col_2 = df.pop("market_matched_vls")
 df.insert(14, col_2.name, col_2)
 df.to_csv(b, index=False)

A brief example of the original CSV for ease of testing:

open_local_data,country,competition,match_id,match_name,market_id,market_name,runner_id,runner_name,status,total_matched,odds,market_matched,percentage,above_odds,result
2022年08月24日 15:00:00,UY,Uruguayan Segunda Division,31684262,Uruguay Montevideo FC v Progreso,1.202440748,Match Odds,11076801,Uruguay Montevideo FC,OPEN,197.2,2.88,448.52,43.96682422188531,9.24460199966309,WINNER
2022年08月24日 15:00:00,AT,Austrian Matches,31685733,SV Gerasdorf Stammersdorf v Dinamo Helfort,1.202453470,Match Odds,10299781,SV Gerasdorf Stammersdorf,OPEN,15.99,3.05,27.12,58.96017699115043,26.17329174524879,LOSER
2022年08月24日 15:00:00,UY,Uruguayan Segunda Division,31684267,Villa Espanola v Sud America,1.202440560,Match Odds,58805,The Draw,OPEN,458.35,3.5,651.11,70.39517132281796,41.82374275138939,LOSER
2022年08月24日 15:00:00,UY,Uruguayan Segunda Division,31684266,Miramar Misiones v Central Espanol,1.202440654,Match Odds,5300627,Miramar Misiones,OPEN,642.05,2.1,1075.66,59.68893516538684,12.069887546339224,LOSER

Question 2

A typical antipattern is using apply when better alternatives exist (see e.g., this question on StackOverflow). Vectorized operations can be several hundred folds faster than using apply, which processes the dataframe in a row-based fashion.

Looking at your functions, especially total_matched and market_matched, they seem to be the same function but applied on a different column. This is bad practice in general as you would have to maintain the same code in two places. Intuitively, a better idea is to write a function that accepts a column name to perform the same computation on different columns.

You don't say why it is important that columns appear in a specific order. Perhaps this for a human reader only, but it's probably not a good idea to trust on the order of columns. However, if we insist on it and especially if your dataset is large, it feels wasteful to pop a column and then reinsert it. Instead, you can just insert the columns directly into the place where you want them.

I'm omitting a few things and making a few assumptions:

The result column will only either contain WINNER or LOSER. It's easy to handle a DRAW too, but I will let you do that.
It's good practice to explicitly set the type of each column of your dataframe. I don't know or understand your data, so I will not do this. You are already saying that market_id is a string, but keep going and note that categoricals are very useful in this and typically result in smaller dataframes (in terms of memory) and faster processing.

To summarize, we could write your snippet as follows:

# -*- coding: utf-8 -*-
from io import StringIO
import pandas as pd
import numpy as np
def describe_matched(df: pd.DataFrame, inspected: str) -> np.ndarray:
 return np.select(
 [
 df["total_matched"] < 100,
 df[inspected].between(1_000, 100_000, inclusive="left"),
 df[inspected].between(100_000, 1_000_000, inclusive="left"),
 ],
 ["centenas", "milhares", "centenas de milhares"],
 default="milhões",
 )
example_data = StringIO(
 """open_local_data,country,competition,match_id,match_name,market_id,market_name,runner_id,runner_name,status,total_matched,odds,market_matched,percentage,above_odds,result
2022年08月24日 15:00:00,UY,Uruguayan Segunda Division,31684262,Uruguay Montevideo FC v Progreso,1.202440748,Match Odds,11076801,Uruguay Montevideo FC,OPEN,197.2,2.88,448.52,43.96682422188531,9.24460199966309,WINNER
2022年08月24日 15:00:00,AT,Austrian Matches,31685733,SV Gerasdorf Stammersdorf v Dinamo Helfort,1.202453470,Match Odds,10299781,SV Gerasdorf Stammersdorf,OPEN,15.99,3.05,27.12,58.96017699115043,26.17329174524879,LOSER
2022年08月24日 15:00:00,UY,Uruguayan Segunda Division,31684267,Villa Espanola v Sud America,1.202440560,Match Odds,58805,The Draw,OPEN,458.35,3.5,651.11,70.39517132281796,41.82374275138939,LOSER
2022年08月24日 15:00:00,UY,Uruguayan Segunda Division,31684266,Miramar Misiones v Central Espanol,1.202440654,Match Odds,5300627,Miramar Misiones,OPEN,642.05,2.1,1075.66,59.68893516538684,12.069887546339224,LOSER
"""
)
df = pd.read_csv(example_data, dtype={"market_id": str})
df = df.assign(
 back=np.where(df["result"] == "WINNER", (df["odds"] - 1) * 0.935, -1),
 lay=np.where(df["result"] == "LOSER", (1 / (df["odds"] - 1)) * 0.935, -1),
)
df.insert(
 loc=11, 
 column="total_matched_vls", 
 value=describe_matched(df, "total_matched")
)
df.insert(
 loc=14, 
 column="market_matched_vls", 
 value=describe_matched(df, "market_matched")
)

Juho Juho 3,63921 silver badges18 bronze badges · Accepted Answer · 2022-12-09 18:32:52Z

A typical antipattern is using apply when better alternatives exist (see e.g., this question on StackOverflow). Vectorized operations can be several hundred folds faster than using apply, which processes the dataframe in a row-based fashion.

Looking at your functions, especially total_matched and market_matched, they seem to be the same function but applied on a different column. This is bad practice in general as you would have to maintain the same code in two places. Intuitively, a better idea is to write a function that accepts a column name to perform the same computation on different columns.

You don't say why it is important that columns appear in a specific order. Perhaps this for a human reader only, but it's probably not a good idea to trust on the order of columns. However, if we insist on it and especially if your dataset is large, it feels wasteful to pop a column and then reinsert it. Instead, you can just insert the columns directly into the place where you want them.

I'm omitting a few things and making a few assumptions:

The result column will only either contain WINNER or LOSER. It's easy to handle a DRAW too, but I will let you do that.
It's good practice to explicitly set the type of each column of your dataframe. I don't know or understand your data, so I will not do this. You are already saying that market_id is a string, but keep going and note that categoricals are very useful in this and typically result in smaller dataframes (in terms of memory) and faster processing.

To summarize, we could write your snippet as follows:

# -*- coding: utf-8 -*-
from io import StringIO
import pandas as pd
import numpy as np
def describe_matched(df: pd.DataFrame, inspected: str) -> np.ndarray:
 return np.select(
 [
 df["total_matched"] < 100,
 df[inspected].between(1_000, 100_000, inclusive="left"),
 df[inspected].between(100_000, 1_000_000, inclusive="left"),
 ],
 ["centenas", "milhares", "centenas de milhares"],
 default="milhões",
 )
example_data = StringIO(
 """open_local_data,country,competition,match_id,match_name,market_id,market_name,runner_id,runner_name,status,total_matched,odds,market_matched,percentage,above_odds,result
2022年08月24日 15:00:00,UY,Uruguayan Segunda Division,31684262,Uruguay Montevideo FC v Progreso,1.202440748,Match Odds,11076801,Uruguay Montevideo FC,OPEN,197.2,2.88,448.52,43.96682422188531,9.24460199966309,WINNER
2022年08月24日 15:00:00,AT,Austrian Matches,31685733,SV Gerasdorf Stammersdorf v Dinamo Helfort,1.202453470,Match Odds,10299781,SV Gerasdorf Stammersdorf,OPEN,15.99,3.05,27.12,58.96017699115043,26.17329174524879,LOSER
2022年08月24日 15:00:00,UY,Uruguayan Segunda Division,31684267,Villa Espanola v Sud America,1.202440560,Match Odds,58805,The Draw,OPEN,458.35,3.5,651.11,70.39517132281796,41.82374275138939,LOSER
2022年08月24日 15:00:00,UY,Uruguayan Segunda Division,31684266,Miramar Misiones v Central Espanol,1.202440654,Match Odds,5300627,Miramar Misiones,OPEN,642.05,2.1,1075.66,59.68893516538684,12.069887546339224,LOSER
"""
)
df = pd.read_csv(example_data, dtype={"market_id": str})
df = df.assign(
 back=np.where(df["result"] == "WINNER", (df["odds"] - 1) * 0.935, -1),
 lay=np.where(df["result"] == "LOSER", (1 / (df["odds"] - 1)) * 0.935, -1),
)
df.insert(
 loc=11, 
 column="total_matched_vls", 
 value=describe_matched(df, "total_matched")
)
df.insert(
 loc=14, 
 column="market_matched_vls", 
 value=describe_matched(df, "market_matched")
)

Stack Exchange Network

Create new columns in a DataFrame using functions and reposition the new columns

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Create new columns in a DataFrame using functions and reposition the new columns

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions