1
\$\begingroup\$

This code cleans up a big dataset into a very clean and a flat file that can be further used downstream like visualization.

I am expecting to improve the code to essentially run it faster and cleanup the code to avoid any inefficient functions that should not take further resources than it should.

The problem is that I have written the code as and when I have experienced any functions as needed to be added on the fly and now, I am not very happy with it.

I am seeking help from this community on the best code that can be written which performs the same work that this code does. I would also be learning on the best practices while working with pandas.

There is so much on the internet on Pandas that every function and code seems to be a clean code but when i look at it collectively, its so bad.

My dataframe:

df.head()
 Unnamed: 0 game score home_odds draw_odds away_odds country league datetime
0 0 Sport Recife - Imperatriz 2:2 1.36 4.31 7.66 Brazil Copa do Nordeste 2020 2020年02月07日 00:00:00
1 1 ABC - America RN 2:1 2.62 3.30 2.48 Brazil Copa do Nordeste 2020 2020年02月02日 22:00:00
2 2 Frei Paulistano - Nautico 0:2 5.19 3.58 1.62 Brazil Copa do Nordeste 2020 2020年02月02日 00:00:00
3 3 Botafogo PB - Confianca 1:1 2.06 3.16 3.5 Brazil Copa do Nordeste 2020 2020年02月02日 22:00:00
4 4 Fortaleza - Ceara 1:1 2.19 2.98 3.38 Brazil Copa do Nordeste 2020 2020年02月02日 22:00:00
df.describe()
 Unnamed: 0
count 1.115767e+06
mean 5.574871e+05
std 3.220941e+05
min 0.000000e+00
25% 2.785455e+05
50% 5.574870e+05
75% 8.364285e+05
max 1.115370e+06

I use this code to cleanup the dataframe:

from datetime import datetime
import pandas as pd
import numpy as np
from tabulate import tabulate
start = datetime.now()
df = pd.read_csv()
#This part essentially splits columns and harmonises the entie dataframe
# This code harmonises the game column e.g. "Talleres (R.E) - Defensores Unidos" should be as "Talleres - "Defensores Unidos" removing any brackets and its values and removes any date values in the column
df['game'] = df['game'].astype(str).str.replace('(\(\w+\))', '', regex=True)
df['game'] = df['game'].astype(str).str.replace('(\s\d+\S\d+)$', '', regex=True)
# This code removes any numerical values in the league column. Many times the league column has years concatenated which is what we don't want e.g "Brazil Copa do Nordeste 2020" should be "Brazil Copa do Nordeste"
df['league'] = df['league'].astype(str).str.replace('(\s\d+\S\d+)$', '', regex=True)
# This part splits the game column into two competing teams i.e. home team and away team by the delimiter "-"
df[['home_team', 'away_team']] = df['game'].str.split(' - ', expand=True, n=1)
# This part splits the score column into two competing teams i.e. home score and away score by the delimiter ":"
df[['home_score', 'away_score']] = df['score'].str.split(':', expand=True)
# This code removes any non numerical values in the home score and away score columns. e.ge scores can't have "aet", "canc", ".", etc. We dont want anything that cannot be identified as filetype:int in pandas
df['away_score'] = df['away_score'].astype(str).str.replace('[a-zA-Z\s\D]', '', regex=True)
df['home_score'] = df['home_score'].astype(str).str.replace('[a-zA-Z\s\D]', '', regex=True)
df = df[df.home_score != "."]
df = df[df.home_score != ".."]
df = df[df.home_score != "."]
df = df[df.home_odds != "-"]
df = df[df.draw_odds != "-"]
df = df[df.away_odds != "-"]
m = df[['home_odds', 'draw_odds', 'away_odds']].astype(str).agg(lambda x: x.str.count('/'), 1).ne(0).all(1)
n = df[['home_score']].agg(lambda x: x.str.count('-'), 1).ne(0).all(1)
o = df[['away_score']].agg(lambda x: x.str.count('-'), 1).ne(0).all(1)
# I get UserWarning: Boolean Series key will be reindexed to match DataFrame index. at these parts
df = df[~n]
df = df[~m]
df = df[~n]
df = df[~o]
df = df[df.home_score != '']
df = df[df.away_score != '']
df = df.dropna()
# We now would be keeping only the columns we want
df = df.loc[:, df.columns.intersection(
 ['datetime', 'country', 'league', 'home_team', 'away_team', 'home_odds', 'draw_odds', 'away_odds', 'home_score',
 'away_score'])]
#We are making sure that the columns are as per data types that we would want pandas to identify. Pandas does not seem to do a very good job identifying data types correctly.
colt = {
 'country': str,
 'league': str,
 'home_team': str,
 'away_team': str,
 'home_odds': float,
 'draw_odds': float,
 'away_odds': float,
 'home_score': int,
 'away_score': int
}
df = df.astype(colt)
# This part removes any leading and trailing whitespaces in the string columns
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
# Cleaning data where odds are greater than 100 and less than -1 and dropping duplicates
df = df[df['home_odds'] <= 100]
df = df[df['draw_odds'] <= 100]
df = df[df['away_odds'] <= 100]
df = df.drop_duplicates(['datetime', 'home_score', 'away_score', 'country', 'league', 'home_team', 'away_team'],
 keep='last')
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
df.to_csv()
time_taken = end - start
print('Time taken to complete: ', time_taken)
df.head()
 home_odds draw_odds away_odds country league datetime home_team away_team home_score away_score
-- ----------- ----------- ----------- --------- ---------------- ------------------- --------------- ----------- ------------ ------------
 0 1.36 4.31 7.66 Brazil Copa do Nordeste 2020年02月07日 00:00:00 Sport Recife Imperatriz 2 2
 1 2.62 3.3 2.48 Brazil Copa do Nordeste 2020年02月02日 22:00:00 ABC America RN 2 1
 2 5.19 3.58 1.62 Brazil Copa do Nordeste 2020年02月02日 00:00:00 Frei Paulistano Nautico 0 2
 3 2.06 3.16 3.5 Brazil Copa do Nordeste 2020年02月02日 22:00:00 Botafogo PB Confianca 1 1
 4 2.19 2.98 3.38 Brazil Copa do Nordeste 2020年02月02日 22:00:00 Fortaleza Ceara 1 1

It takes me 9 minutes to run this code with warnings:

sys:1: DtypeWarning: Columns (3,4,5) have mixed types.Specify dtype option on import or set low_memory=False.
G:/My Drive/Odds/Code/5. Creating updated training data.py:33: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
 df = df[~n]
G:/My Drive/Odds/Code/5. Creating updated training data.py:34: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
 df = df[~o]

How can I cleanup this code and run it faster using pandas?

Also, I have a GPU so I can exploit cudf however I am using the python 3.7 environment and cudf does not seem to support it and am not familiar with conda yet.

asked Nov 5, 2021 at 7:23
\$\endgroup\$
6
  • 3
    \$\begingroup\$ Welcome to Code Review! The current question title, which states your concerns about the code, is too general to be useful here. Please edit to the site standard, which is for the title to simply state the task accomplished by the code (and in the description, please be a bit more specific about what you mean by "clean up"). Please see How to get the best value out of Code Review: Asking Questions for guidance on writing good question titles. \$\endgroup\$ Commented Nov 5, 2021 at 8:02
  • \$\begingroup\$ Put the part about optimizing into the body of the question. What is the game? What does the code do with the data frame? \$\endgroup\$ Commented Nov 5, 2021 at 11:57
  • \$\begingroup\$ Thank you for your feedback. I have updated the body of my question. Is there anything else I should be clarifying? \$\endgroup\$ Commented Nov 6, 2021 at 7:16
  • \$\begingroup\$ (I don't understand one example from the code: as "Talleres - "Defensores Unidos" is not well-formed.) \$\endgroup\$ Commented Nov 10, 2021 at 7:01
  • \$\begingroup\$ problem is [I wrote] code as and when [needed] One problem may be that you did not write and keep a specification of what to achieve. \$\endgroup\$ Commented Nov 10, 2021 at 7:05

1 Answer 1

2
\$\begingroup\$

Any time that you have a section labelled This part, that's a good indication that you should have a function. Code like this should not exist in a flat file with no structure. Among many other reasons,

  • once one of your subroutines finishes, any intermediate variables that go out of scope like m, n etc. will be eligible for garbage collection
  • performing profiling will be much more possible - please do this; otherwise you're effectively flying blind
  • it will be more legible and self-documenting
  • it will be possible for your IDE to perform folding, etc.

Until you have profiling data it's effectively impossible to say what part of your code is the slowest. Some random guesses:

  • where you do astype(str), that's likely not necessary
  • when you run successive predicate selection like
df = df[df.home_score != "."]
df = df[df.home_score != ".."]

you should instead do only one [] indexing operation, and combine your predicates logically using &

  • See if there is a performance difference if you prefer in-place operations, including dropping columns and NA, which are both possible in-place
  • Does your colt actually do anything? Is there any evidence that your column types are non-uniform? Could you, instead of doing a blanket type coerce, target specific columns that need conversion?
  • Avoid applymap. There is a vectorised strip available. You do this twice - why?

Non-performance topics:

  • Don't name variables m, n, o - these are opaque and non-descriptive
  • Rather than writing "this part harmonises ...", you can just write "harmonise ..." - i.e. use the imperative mood which is common in function documentation.

A first stab, addressing some of the above, looks like

import pandas as pd
from datetime import datetime
def harmonize_game(df: pd.DataFrame) -> pd.DataFrame:
 """
 harmonise the game column e.g. "Talleres (R.E) - Defensores Unidos" should
 be split as "Talleres - "Defensores Unidos" and removes any date values in
 the column
 """
 df['game'] = df['game'].astype(str).str.replace(r'(\(\w+\))', '', regex=True)
 df['game'] = df['game'].astype(str).str.replace(r'(\s\d+\S\d+)$', '', regex=True)
 # Remove any numerical values in the league column. Many times the league
 # column has years concatenated which is what we don't want e.g "Brazil
 # Copa do Nordeste 2020" should be "Brazil Copa do Nordeste"
 df['league'] = df['league'].astype(str).str.replace(r'(\s\d+\S\d+)$', '', regex=True)
 # Split the game column into tow competing teams i.e. home team and away team by the delimiter "-"
 df[['home_team', 'away_team']] = df['game'].str.split(' - ', expand=True, n=1)
 # Split the game column into tow competing teams i.e. home team and away team by the delimiter ":"
 df[['home_score', 'away_score']] = df['score'].str.split(':', expand=True)
 return df
def numerical_scores(df: pd.DataFrame) -> pd.DataFrame:
 """
 Remove any non numerical values in the home score and away score columns.
 e.ge scores can have "aet", "canc", ".", etc. We don't want anything that
 cannot be identified as filetype:int in pandas
 """
 df['away_score'] = df['away_score'].astype(str).str.replace(r'[a-zA-Z\s\D]', '', regex=True)
 df['home_score'] = df['home_score'].astype(str).str.replace(r'[a-zA-Z\s\D]', '', regex=True)
 df = df[df.home_score != "."]
 df = df[df.home_score != ".."]
 df = df[df.home_score != "."]
 df = df[df.home_odds != "-"]
 df = df[df.draw_odds != "-"]
 df = df[df.away_odds != "-"]
 m = df[['home_odds', 'draw_odds', 'away_odds']].astype(str).agg(lambda x: x.str.count('/'), 1).ne(0).all(1)
 n = df[['home_score']].agg(lambda x: x.str.count('-'), 1).ne(0).all(1)
 o = df[['away_score']].agg(lambda x: x.str.count('-'), 1).ne(0).all(1)
 df = df[~m]
 df = df[~n]
 df = df[~o]
 df = df[df.home_score != '']
 df = df[df.away_score != '']
 df = df.dropna()
 return df
def coerce_columns(df: pd.DataFrame) -> pd.DataFrame:
 """Keep only the columns we want"""
 df = df.loc[
 :, df.columns.intersection([
 'datetime', 'country', 'league', 'home_team', 'away_team',
 'home_odds', 'draw_odds', 'away_odds', 'home_score',
 'away_score',
 ])
 ]
 # Make sure that the columns are as per data types that we would want pandas
 # to identify. Pandas does not seem to do a very good job identifying data
 # types correctly.
 colt = {
 'country': str,
 'league': str,
 'home_team': str,
 'away_team': str,
 'home_odds': float,
 'draw_odds': float,
 'away_odds': float,
 'home_score': int,
 'away_score': int
 }
 df = df.astype(colt)
 return df
def strip_strings(df: pd.DataFrame) -> pd.DataFrame:
 """remove any leading and trailing whitespaces in the string columns"""
 return df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
def clean_odds(df: pd.DataFrame) -> pd.DataFrame:
 """Cleaning data where odds are greater than 100 and less than -1 and dropping duplicates"""
 df = df[df['home_odds'] <= 100]
 df = df[df['draw_odds'] <= 100]
 df = df[df['away_odds'] <= 100]
 df = df.drop_duplicates(
 ['datetime', 'home_score', 'away_score', 'country', 'league', 'home_team', 'away_team'],
 keep='last',
 )
 df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
 return df
def clean(df: pd.DataFrame) -> pd.DataFrame:
 df = harmonize_game(df)
 df = numerical_scores(df)
 df = coerce_columns(df)
 df = strip_strings(df)
 df = clean_odds(df)
 return df
def test() -> None:
 df = pd.DataFrame(
 (
 (0, 'Sport Recife - Imperatriz', '2:2', 1.36, 4.31, 7.66, 'Brazil', 'Copa do Nordeste 2020', datetime.strptime('2020-02-07 00:00:00', '%Y-%m-%d %H:%M:%S')),
 (1, 'ABC - America RN', '2:1', 2.62, 3.30, 2.48, 'Brazil', 'Copa do Nordeste 2020', datetime.strptime('2020-02-02 22:00:00', '%Y-%m-%d %H:%M:%S')),
 (2, 'Frei Paulistano - Nautico', '0:2', 5.19, 3.58, 1.62, 'Brazil', 'Copa do Nordeste 2020', datetime.strptime('2020-02-02 00:00:00', '%Y-%m-%d %H:%M:%S')),
 (3, 'Botafogo PB - Confianca', '1:1', 2.06, 3.16, 3.50, 'Brazil', 'Copa do Nordeste 2020', datetime.strptime('2020-02-02 22:00:00', '%Y-%m-%d %H:%M:%S')),
 (4, 'Fortaleza - Ceara', '1:1', 2.19, 2.98, 3.38, 'Brazil', 'Copa do Nordeste 2020', datetime.strptime('2020-02-02 22:00:00', '%Y-%m-%d %H:%M:%S')),
 ),
 columns=(
 'Unnamed: 0', 'game',
 'score',
 'home_odds',
 'draw_odds',
 'away_odds', 'country', 'league', 'datetime',
 ),
 )
 clean(df)
if __name__ == '__main__':
 test()
answered Nov 6, 2021 at 16:16
\$\endgroup\$
4
  • \$\begingroup\$ Thank you. The code is much cleaner and I get that building functions is a better way to approach this. However, I am still getting the same time as my earlier code and I still get the Boolean Reindexing warning. While I understand, the time cannot be improved however, can I resolve the reindexing warning I get at df = df[~n] and df = df[~o] ? \$\endgroup\$ Commented Nov 7, 2021 at 0:39
  • \$\begingroup\$ It's not strictly guaranteed that the time can't be improved. Again, you need to profile your code. \$\endgroup\$ Commented Nov 7, 2021 at 4:06
  • \$\begingroup\$ Can I ask for a solution for the same code but using cudf on this platform? How can I ask it? \$\endgroup\$ Commented Nov 11, 2021 at 9:49
  • \$\begingroup\$ How do I save the resulting dataframe to_csv? \$\endgroup\$ Commented Nov 11, 2021 at 13:21

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.