This code cleans up a big dataset into a very clean and a flat file that can be further used downstream like visualization.
I am expecting to improve the code to essentially run it faster and cleanup the code to avoid any inefficient functions that should not take further resources than it should.
The problem is that I have written the code as and when I have experienced any functions as needed to be added on the fly and now, I am not very happy with it.
I am seeking help from this community on the best code that can be written which performs the same work that this code does. I would also be learning on the best practices while working with pandas.
There is so much on the internet on Pandas that every function and code seems to be a clean code but when i look at it collectively, its so bad.
My dataframe:
df.head()
Unnamed: 0 game score home_odds draw_odds away_odds country league datetime
0 0 Sport Recife - Imperatriz 2:2 1.36 4.31 7.66 Brazil Copa do Nordeste 2020 2020年02月07日 00:00:00
1 1 ABC - America RN 2:1 2.62 3.30 2.48 Brazil Copa do Nordeste 2020 2020年02月02日 22:00:00
2 2 Frei Paulistano - Nautico 0:2 5.19 3.58 1.62 Brazil Copa do Nordeste 2020 2020年02月02日 00:00:00
3 3 Botafogo PB - Confianca 1:1 2.06 3.16 3.5 Brazil Copa do Nordeste 2020 2020年02月02日 22:00:00
4 4 Fortaleza - Ceara 1:1 2.19 2.98 3.38 Brazil Copa do Nordeste 2020 2020年02月02日 22:00:00
df.describe()
Unnamed: 0
count 1.115767e+06
mean 5.574871e+05
std 3.220941e+05
min 0.000000e+00
25% 2.785455e+05
50% 5.574870e+05
75% 8.364285e+05
max 1.115370e+06
I use this code to cleanup the dataframe:
from datetime import datetime
import pandas as pd
import numpy as np
from tabulate import tabulate
start = datetime.now()
df = pd.read_csv()
#This part essentially splits columns and harmonises the entie dataframe
# This code harmonises the game column e.g. "Talleres (R.E) - Defensores Unidos" should be as "Talleres - "Defensores Unidos" removing any brackets and its values and removes any date values in the column
df['game'] = df['game'].astype(str).str.replace('(\(\w+\))', '', regex=True)
df['game'] = df['game'].astype(str).str.replace('(\s\d+\S\d+)$', '', regex=True)
# This code removes any numerical values in the league column. Many times the league column has years concatenated which is what we don't want e.g "Brazil Copa do Nordeste 2020" should be "Brazil Copa do Nordeste"
df['league'] = df['league'].astype(str).str.replace('(\s\d+\S\d+)$', '', regex=True)
# This part splits the game column into two competing teams i.e. home team and away team by the delimiter "-"
df[['home_team', 'away_team']] = df['game'].str.split(' - ', expand=True, n=1)
# This part splits the score column into two competing teams i.e. home score and away score by the delimiter ":"
df[['home_score', 'away_score']] = df['score'].str.split(':', expand=True)
# This code removes any non numerical values in the home score and away score columns. e.ge scores can't have "aet", "canc", ".", etc. We dont want anything that cannot be identified as filetype:int in pandas
df['away_score'] = df['away_score'].astype(str).str.replace('[a-zA-Z\s\D]', '', regex=True)
df['home_score'] = df['home_score'].astype(str).str.replace('[a-zA-Z\s\D]', '', regex=True)
df = df[df.home_score != "."]
df = df[df.home_score != ".."]
df = df[df.home_score != "."]
df = df[df.home_odds != "-"]
df = df[df.draw_odds != "-"]
df = df[df.away_odds != "-"]
m = df[['home_odds', 'draw_odds', 'away_odds']].astype(str).agg(lambda x: x.str.count('/'), 1).ne(0).all(1)
n = df[['home_score']].agg(lambda x: x.str.count('-'), 1).ne(0).all(1)
o = df[['away_score']].agg(lambda x: x.str.count('-'), 1).ne(0).all(1)
# I get UserWarning: Boolean Series key will be reindexed to match DataFrame index. at these parts
df = df[~n]
df = df[~m]
df = df[~n]
df = df[~o]
df = df[df.home_score != '']
df = df[df.away_score != '']
df = df.dropna()
# We now would be keeping only the columns we want
df = df.loc[:, df.columns.intersection(
['datetime', 'country', 'league', 'home_team', 'away_team', 'home_odds', 'draw_odds', 'away_odds', 'home_score',
'away_score'])]
#We are making sure that the columns are as per data types that we would want pandas to identify. Pandas does not seem to do a very good job identifying data types correctly.
colt = {
'country': str,
'league': str,
'home_team': str,
'away_team': str,
'home_odds': float,
'draw_odds': float,
'away_odds': float,
'home_score': int,
'away_score': int
}
df = df.astype(colt)
# This part removes any leading and trailing whitespaces in the string columns
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
# Cleaning data where odds are greater than 100 and less than -1 and dropping duplicates
df = df[df['home_odds'] <= 100]
df = df[df['draw_odds'] <= 100]
df = df[df['away_odds'] <= 100]
df = df.drop_duplicates(['datetime', 'home_score', 'away_score', 'country', 'league', 'home_team', 'away_team'],
keep='last')
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
df.to_csv()
time_taken = end - start
print('Time taken to complete: ', time_taken)
df.head()
home_odds draw_odds away_odds country league datetime home_team away_team home_score away_score
-- ----------- ----------- ----------- --------- ---------------- ------------------- --------------- ----------- ------------ ------------
0 1.36 4.31 7.66 Brazil Copa do Nordeste 2020年02月07日 00:00:00 Sport Recife Imperatriz 2 2
1 2.62 3.3 2.48 Brazil Copa do Nordeste 2020年02月02日 22:00:00 ABC America RN 2 1
2 5.19 3.58 1.62 Brazil Copa do Nordeste 2020年02月02日 00:00:00 Frei Paulistano Nautico 0 2
3 2.06 3.16 3.5 Brazil Copa do Nordeste 2020年02月02日 22:00:00 Botafogo PB Confianca 1 1
4 2.19 2.98 3.38 Brazil Copa do Nordeste 2020年02月02日 22:00:00 Fortaleza Ceara 1 1
It takes me 9 minutes to run this code with warnings:
sys:1: DtypeWarning: Columns (3,4,5) have mixed types.Specify dtype option on import or set low_memory=False.
G:/My Drive/Odds/Code/5. Creating updated training data.py:33: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
df = df[~n]
G:/My Drive/Odds/Code/5. Creating updated training data.py:34: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
df = df[~o]
How can I cleanup this code and run it faster using pandas?
Also, I have a GPU so I can exploit cudf however I am using the python 3.7 environment and cudf does not seem to support it and am not familiar with conda yet.
1 Answer 1
Any time that you have a section labelled This part
, that's a good indication that you should have a function. Code like this should not exist in a flat file with no structure. Among many other reasons,
- once one of your subroutines finishes, any intermediate variables that go out of scope like
m
,n
etc. will be eligible for garbage collection - performing profiling will be much more possible - please do this; otherwise you're effectively flying blind
- it will be more legible and self-documenting
- it will be possible for your IDE to perform folding, etc.
Until you have profiling data it's effectively impossible to say what part of your code is the slowest. Some random guesses:
- where you do
astype(str)
, that's likely not necessary - when you run successive predicate selection like
df = df[df.home_score != "."]
df = df[df.home_score != ".."]
you should instead do only one []
indexing operation, and combine your predicates logically using &
- See if there is a performance difference if you prefer in-place operations, including dropping columns and NA, which are both possible in-place
- Does your
colt
actually do anything? Is there any evidence that your column types are non-uniform? Could you, instead of doing a blanket type coerce, target specific columns that need conversion? - Avoid
applymap
. There is a vectorisedstrip
available. You do this twice - why?
Non-performance topics:
- Don't name variables
m
,n
,o
- these are opaque and non-descriptive - Rather than writing "this part harmonises ...", you can just write "harmonise ..." - i.e. use the imperative mood which is common in function documentation.
A first stab, addressing some of the above, looks like
import pandas as pd
from datetime import datetime
def harmonize_game(df: pd.DataFrame) -> pd.DataFrame:
"""
harmonise the game column e.g. "Talleres (R.E) - Defensores Unidos" should
be split as "Talleres - "Defensores Unidos" and removes any date values in
the column
"""
df['game'] = df['game'].astype(str).str.replace(r'(\(\w+\))', '', regex=True)
df['game'] = df['game'].astype(str).str.replace(r'(\s\d+\S\d+)$', '', regex=True)
# Remove any numerical values in the league column. Many times the league
# column has years concatenated which is what we don't want e.g "Brazil
# Copa do Nordeste 2020" should be "Brazil Copa do Nordeste"
df['league'] = df['league'].astype(str).str.replace(r'(\s\d+\S\d+)$', '', regex=True)
# Split the game column into tow competing teams i.e. home team and away team by the delimiter "-"
df[['home_team', 'away_team']] = df['game'].str.split(' - ', expand=True, n=1)
# Split the game column into tow competing teams i.e. home team and away team by the delimiter ":"
df[['home_score', 'away_score']] = df['score'].str.split(':', expand=True)
return df
def numerical_scores(df: pd.DataFrame) -> pd.DataFrame:
"""
Remove any non numerical values in the home score and away score columns.
e.ge scores can have "aet", "canc", ".", etc. We don't want anything that
cannot be identified as filetype:int in pandas
"""
df['away_score'] = df['away_score'].astype(str).str.replace(r'[a-zA-Z\s\D]', '', regex=True)
df['home_score'] = df['home_score'].astype(str).str.replace(r'[a-zA-Z\s\D]', '', regex=True)
df = df[df.home_score != "."]
df = df[df.home_score != ".."]
df = df[df.home_score != "."]
df = df[df.home_odds != "-"]
df = df[df.draw_odds != "-"]
df = df[df.away_odds != "-"]
m = df[['home_odds', 'draw_odds', 'away_odds']].astype(str).agg(lambda x: x.str.count('/'), 1).ne(0).all(1)
n = df[['home_score']].agg(lambda x: x.str.count('-'), 1).ne(0).all(1)
o = df[['away_score']].agg(lambda x: x.str.count('-'), 1).ne(0).all(1)
df = df[~m]
df = df[~n]
df = df[~o]
df = df[df.home_score != '']
df = df[df.away_score != '']
df = df.dropna()
return df
def coerce_columns(df: pd.DataFrame) -> pd.DataFrame:
"""Keep only the columns we want"""
df = df.loc[
:, df.columns.intersection([
'datetime', 'country', 'league', 'home_team', 'away_team',
'home_odds', 'draw_odds', 'away_odds', 'home_score',
'away_score',
])
]
# Make sure that the columns are as per data types that we would want pandas
# to identify. Pandas does not seem to do a very good job identifying data
# types correctly.
colt = {
'country': str,
'league': str,
'home_team': str,
'away_team': str,
'home_odds': float,
'draw_odds': float,
'away_odds': float,
'home_score': int,
'away_score': int
}
df = df.astype(colt)
return df
def strip_strings(df: pd.DataFrame) -> pd.DataFrame:
"""remove any leading and trailing whitespaces in the string columns"""
return df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
def clean_odds(df: pd.DataFrame) -> pd.DataFrame:
"""Cleaning data where odds are greater than 100 and less than -1 and dropping duplicates"""
df = df[df['home_odds'] <= 100]
df = df[df['draw_odds'] <= 100]
df = df[df['away_odds'] <= 100]
df = df.drop_duplicates(
['datetime', 'home_score', 'away_score', 'country', 'league', 'home_team', 'away_team'],
keep='last',
)
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
return df
def clean(df: pd.DataFrame) -> pd.DataFrame:
df = harmonize_game(df)
df = numerical_scores(df)
df = coerce_columns(df)
df = strip_strings(df)
df = clean_odds(df)
return df
def test() -> None:
df = pd.DataFrame(
(
(0, 'Sport Recife - Imperatriz', '2:2', 1.36, 4.31, 7.66, 'Brazil', 'Copa do Nordeste 2020', datetime.strptime('2020-02-07 00:00:00', '%Y-%m-%d %H:%M:%S')),
(1, 'ABC - America RN', '2:1', 2.62, 3.30, 2.48, 'Brazil', 'Copa do Nordeste 2020', datetime.strptime('2020-02-02 22:00:00', '%Y-%m-%d %H:%M:%S')),
(2, 'Frei Paulistano - Nautico', '0:2', 5.19, 3.58, 1.62, 'Brazil', 'Copa do Nordeste 2020', datetime.strptime('2020-02-02 00:00:00', '%Y-%m-%d %H:%M:%S')),
(3, 'Botafogo PB - Confianca', '1:1', 2.06, 3.16, 3.50, 'Brazil', 'Copa do Nordeste 2020', datetime.strptime('2020-02-02 22:00:00', '%Y-%m-%d %H:%M:%S')),
(4, 'Fortaleza - Ceara', '1:1', 2.19, 2.98, 3.38, 'Brazil', 'Copa do Nordeste 2020', datetime.strptime('2020-02-02 22:00:00', '%Y-%m-%d %H:%M:%S')),
),
columns=(
'Unnamed: 0', 'game',
'score',
'home_odds',
'draw_odds',
'away_odds', 'country', 'league', 'datetime',
),
)
clean(df)
if __name__ == '__main__':
test()
-
\$\begingroup\$ Thank you. The code is much cleaner and I get that building functions is a better way to approach this. However, I am still getting the same time as my earlier code and I still get the Boolean Reindexing warning. While I understand, the time cannot be improved however, can I resolve the reindexing warning I get at
df = df[~n]
anddf = df[~o]
? \$\endgroup\$leonardo– leonardo2021年11月07日 00:39:52 +00:00Commented Nov 7, 2021 at 0:39 -
\$\begingroup\$ It's not strictly guaranteed that the time can't be improved. Again, you need to profile your code. \$\endgroup\$Reinderien– Reinderien2021年11月07日 04:06:03 +00:00Commented Nov 7, 2021 at 4:06
-
\$\begingroup\$ Can I ask for a solution for the same code but using
cudf
on this platform? How can I ask it? \$\endgroup\$leonardo– leonardo2021年11月11日 09:49:24 +00:00Commented Nov 11, 2021 at 9:49 -
\$\begingroup\$ How do I save the resulting dataframe
to_csv
? \$\endgroup\$leonardo– leonardo2021年11月11日 13:21:54 +00:00Commented Nov 11, 2021 at 13:21
as "Talleres - "Defensores Unidos"
is not well-formed.) \$\endgroup\$problem is [I wrote] code as and when [needed]
One problem may be that you did not write and keep a specification of what to achieve. \$\endgroup\$