I would like a review regarding the method I use to create the new columns and then reposition them in the correct place where they should be.
The new column called total_matched_vls
should always be on the right side of the column called total_matched
, which in the original CSV is at position 10
The new column called market_matched_vls
should always be on the right side of the column called market_matched
, which in the original CSV is at position 12
import pandas as pd
def calc_back(df):
if (df['result'] == 'WINNER'):
return (df['odds']-1)*0.935
elif (df['result'] == 'LOSER'):
return -1
else:
return ''
def calc_lay(df):
if (df['result'] == 'LOSER'):
return (1/(df['odds']-1))*0.935
elif (df['result'] == 'WINNER'):
return -1
else:
return ''
def total_matched(df):
if (df['total_matched'] < 1000):
return 'centenas'
elif (df['total_matched'] >= 1000) and (df['total_matched'] < 100000):
return 'milhares'
elif (df['total_matched'] >= 100000) and (df['total_matched'] < 1000000):
return 'centenas de milhares'
elif (df['total_matched'] >= 1000000):
return 'milhões'
else:
return ''
def market_matched(df):
if (df['market_matched'] < 1000):
return 'centenas'
elif (df['market_matched'] >= 1000) and (df['market_matched'] < 100000):
return 'milhares'
elif (df['market_matched'] >= 100000) and (df['market_matched'] < 1000000):
return 'centenas de milhares'
elif (df['market_matched'] >= 1000000):
return 'milhões'
else:
return ''
filials = [
[
r'C:\Users\Computador\Desktop\csv\history.csv',
'history_p&l.csv'
],
[
r'C:\Users\Computador\Desktop\csv\results.csv',
'results_p&l.csv'
]
]
for a,b in filials:
df = pd.read_csv(a, dtype={'market_id': str})
df['back'] = df.apply(calc_back, axis=1)
df['lay'] = df.apply(calc_lay, axis=1)
df['total_matched_vls'] = df.apply(total_matched, axis=1)
df['market_matched_vls'] = df.apply(market_matched, axis=1)
col = df.pop("total_matched_vls")
df.insert(11, col.name, col)
col_2 = df.pop("market_matched_vls")
df.insert(14, col_2.name, col_2)
df.to_csv(b, index=False)
A brief example of the original CSV for ease of testing:
open_local_data,country,competition,match_id,match_name,market_id,market_name,runner_id,runner_name,status,total_matched,odds,market_matched,percentage,above_odds,result
2022年08月24日 15:00:00,UY,Uruguayan Segunda Division,31684262,Uruguay Montevideo FC v Progreso,1.202440748,Match Odds,11076801,Uruguay Montevideo FC,OPEN,197.2,2.88,448.52,43.96682422188531,9.24460199966309,WINNER
2022年08月24日 15:00:00,AT,Austrian Matches,31685733,SV Gerasdorf Stammersdorf v Dinamo Helfort,1.202453470,Match Odds,10299781,SV Gerasdorf Stammersdorf,OPEN,15.99,3.05,27.12,58.96017699115043,26.17329174524879,LOSER
2022年08月24日 15:00:00,UY,Uruguayan Segunda Division,31684267,Villa Espanola v Sud America,1.202440560,Match Odds,58805,The Draw,OPEN,458.35,3.5,651.11,70.39517132281796,41.82374275138939,LOSER
2022年08月24日 15:00:00,UY,Uruguayan Segunda Division,31684266,Miramar Misiones v Central Espanol,1.202440654,Match Odds,5300627,Miramar Misiones,OPEN,642.05,2.1,1075.66,59.68893516538684,12.069887546339224,LOSER
1 Answer 1
A typical antipattern is using apply when better alternatives exist (see e.g., this question on StackOverflow). Vectorized operations can be several hundred folds faster than using apply, which processes the dataframe in a row-based fashion.
Looking at your functions, especially total_matched
and market_matched
, they seem to be the same function but applied on a different column. This is bad practice in general as you would have to maintain the same code in two places. Intuitively, a better idea is to write a function that accepts a column name to perform the same computation on different columns.
You don't say why it is important that columns appear in a specific order. Perhaps this for a human reader only, but it's probably not a good idea to trust on the order of columns. However, if we insist on it and especially if your dataset is large, it feels wasteful to pop a column and then reinsert it. Instead, you can just insert the columns directly into the place where you want them.
I'm omitting a few things and making a few assumptions:
- The
result
column will only either containWINNER
orLOSER
. It's easy to handle aDRAW
too, but I will let you do that. - It's good practice to explicitly set the type of each column of your dataframe. I don't know or understand your data, so I will not do this. You are already saying that
market_id
is a string, but keep going and note that categoricals are very useful in this and typically result in smaller dataframes (in terms of memory) and faster processing.
To summarize, we could write your snippet as follows:
# -*- coding: utf-8 -*-
from io import StringIO
import pandas as pd
import numpy as np
def describe_matched(df: pd.DataFrame, inspected: str) -> np.ndarray:
return np.select(
[
df["total_matched"] < 100,
df[inspected].between(1_000, 100_000, inclusive="left"),
df[inspected].between(100_000, 1_000_000, inclusive="left"),
],
["centenas", "milhares", "centenas de milhares"],
default="milhões",
)
example_data = StringIO(
"""open_local_data,country,competition,match_id,match_name,market_id,market_name,runner_id,runner_name,status,total_matched,odds,market_matched,percentage,above_odds,result
2022年08月24日 15:00:00,UY,Uruguayan Segunda Division,31684262,Uruguay Montevideo FC v Progreso,1.202440748,Match Odds,11076801,Uruguay Montevideo FC,OPEN,197.2,2.88,448.52,43.96682422188531,9.24460199966309,WINNER
2022年08月24日 15:00:00,AT,Austrian Matches,31685733,SV Gerasdorf Stammersdorf v Dinamo Helfort,1.202453470,Match Odds,10299781,SV Gerasdorf Stammersdorf,OPEN,15.99,3.05,27.12,58.96017699115043,26.17329174524879,LOSER
2022年08月24日 15:00:00,UY,Uruguayan Segunda Division,31684267,Villa Espanola v Sud America,1.202440560,Match Odds,58805,The Draw,OPEN,458.35,3.5,651.11,70.39517132281796,41.82374275138939,LOSER
2022年08月24日 15:00:00,UY,Uruguayan Segunda Division,31684266,Miramar Misiones v Central Espanol,1.202440654,Match Odds,5300627,Miramar Misiones,OPEN,642.05,2.1,1075.66,59.68893516538684,12.069887546339224,LOSER
"""
)
df = pd.read_csv(example_data, dtype={"market_id": str})
df = df.assign(
back=np.where(df["result"] == "WINNER", (df["odds"] - 1) * 0.935, -1),
lay=np.where(df["result"] == "LOSER", (1 / (df["odds"] - 1)) * 0.935, -1),
)
df.insert(
loc=11,
column="total_matched_vls",
value=describe_matched(df, "total_matched")
)
df.insert(
loc=14,
column="market_matched_vls",
value=describe_matched(df, "market_matched")
)