Cleanup dataset for visualization

Question 1

This code cleans up a big dataset into a very clean and a flat file that can be further used downstream like visualization.

I am expecting to improve the code to essentially run it faster and cleanup the code to avoid any inefficient functions that should not take further resources than it should.

The problem is that I have written the code as and when I have experienced any functions as needed to be added on the fly and now, I am not very happy with it.

I am seeking help from this community on the best code that can be written which performs the same work that this code does. I would also be learning on the best practices while working with pandas.

There is so much on the internet on Pandas that every function and code seems to be a clean code but when i look at it collectively, its so bad.

My dataframe:

df.head()
 Unnamed: 0 game score home_odds draw_odds away_odds country league datetime
0 0 Sport Recife - Imperatriz 2:2 1.36 4.31 7.66 Brazil Copa do Nordeste 2020 2020年02月07日 00:00:00
1 1 ABC - America RN 2:1 2.62 3.30 2.48 Brazil Copa do Nordeste 2020 2020年02月02日 22:00:00
2 2 Frei Paulistano - Nautico 0:2 5.19 3.58 1.62 Brazil Copa do Nordeste 2020 2020年02月02日 00:00:00
3 3 Botafogo PB - Confianca 1:1 2.06 3.16 3.5 Brazil Copa do Nordeste 2020 2020年02月02日 22:00:00
4 4 Fortaleza - Ceara 1:1 2.19 2.98 3.38 Brazil Copa do Nordeste 2020 2020年02月02日 22:00:00
df.describe()
 Unnamed: 0
count 1.115767e+06
mean 5.574871e+05
std 3.220941e+05
min 0.000000e+00
25% 2.785455e+05
50% 5.574870e+05
75% 8.364285e+05
max 1.115370e+06

I use this code to cleanup the dataframe:

from datetime import datetime
import pandas as pd
import numpy as np
from tabulate import tabulate
start = datetime.now()
df = pd.read_csv()
#This part essentially splits columns and harmonises the entie dataframe
# This code harmonises the game column e.g. "Talleres (R.E) - Defensores Unidos" should be as "Talleres - "Defensores Unidos" removing any brackets and its values and removes any date values in the column
df['game'] = df['game'].astype(str).str.replace('(\(\w+\))', '', regex=True)
df['game'] = df['game'].astype(str).str.replace('(\s\d+\S\d+)$', '', regex=True)
# This code removes any numerical values in the league column. Many times the league column has years concatenated which is what we don't want e.g "Brazil Copa do Nordeste 2020" should be "Brazil Copa do Nordeste"
df['league'] = df['league'].astype(str).str.replace('(\s\d+\S\d+)$', '', regex=True)
# This part splits the game column into two competing teams i.e. home team and away team by the delimiter "-"
df[['home_team', 'away_team']] = df['game'].str.split(' - ', expand=True, n=1)
# This part splits the score column into two competing teams i.e. home score and away score by the delimiter ":"
df[['home_score', 'away_score']] = df['score'].str.split(':', expand=True)
# This code removes any non numerical values in the home score and away score columns. e.ge scores can't have "aet", "canc", ".", etc. We dont want anything that cannot be identified as filetype:int in pandas
df['away_score'] = df['away_score'].astype(str).str.replace('[a-zA-Z\s\D]', '', regex=True)
df['home_score'] = df['home_score'].astype(str).str.replace('[a-zA-Z\s\D]', '', regex=True)
df = df[df.home_score != "."]
df = df[df.home_score != ".."]
df = df[df.home_score != "."]
df = df[df.home_odds != "-"]
df = df[df.draw_odds != "-"]
df = df[df.away_odds != "-"]
m = df[['home_odds', 'draw_odds', 'away_odds']].astype(str).agg(lambda x: x.str.count('/'), 1).ne(0).all(1)
n = df[['home_score']].agg(lambda x: x.str.count('-'), 1).ne(0).all(1)
o = df[['away_score']].agg(lambda x: x.str.count('-'), 1).ne(0).all(1)
# I get UserWarning: Boolean Series key will be reindexed to match DataFrame index. at these parts
df = df[~n]
df = df[~m]
df = df[~n]
df = df[~o]
df = df[df.home_score != '']
df = df[df.away_score != '']
df = df.dropna()
# We now would be keeping only the columns we want
df = df.loc[:, df.columns.intersection(
 ['datetime', 'country', 'league', 'home_team', 'away_team', 'home_odds', 'draw_odds', 'away_odds', 'home_score',
 'away_score'])]
#We are making sure that the columns are as per data types that we would want pandas to identify. Pandas does not seem to do a very good job identifying data types correctly.
colt = {
 'country': str,
 'league': str,
 'home_team': str,
 'away_team': str,
 'home_odds': float,
 'draw_odds': float,
 'away_odds': float,
 'home_score': int,
 'away_score': int
}
df = df.astype(colt)
# This part removes any leading and trailing whitespaces in the string columns
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
# Cleaning data where odds are greater than 100 and less than -1 and dropping duplicates
df = df[df['home_odds'] <= 100]
df = df[df['draw_odds'] <= 100]
df = df[df['away_odds'] <= 100]
df = df.drop_duplicates(['datetime', 'home_score', 'away_score', 'country', 'league', 'home_team', 'away_team'],
 keep='last')
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
df.to_csv()
time_taken = end - start
print('Time taken to complete: ', time_taken)
df.head()
 home_odds draw_odds away_odds country league datetime home_team away_team home_score away_score
-- ----------- ----------- ----------- --------- ---------------- ------------------- --------------- ----------- ------------ ------------
 0 1.36 4.31 7.66 Brazil Copa do Nordeste 2020年02月07日 00:00:00 Sport Recife Imperatriz 2 2
 1 2.62 3.3 2.48 Brazil Copa do Nordeste 2020年02月02日 22:00:00 ABC America RN 2 1
 2 5.19 3.58 1.62 Brazil Copa do Nordeste 2020年02月02日 00:00:00 Frei Paulistano Nautico 0 2
 3 2.06 3.16 3.5 Brazil Copa do Nordeste 2020年02月02日 22:00:00 Botafogo PB Confianca 1 1
 4 2.19 2.98 3.38 Brazil Copa do Nordeste 2020年02月02日 22:00:00 Fortaleza Ceara 1 1

It takes me 9 minutes to run this code with warnings:

sys:1: DtypeWarning: Columns (3,4,5) have mixed types.Specify dtype option on import or set low_memory=False.
G:/My Drive/Odds/Code/5. Creating updated training data.py:33: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
 df = df[~n]
G:/My Drive/Odds/Code/5. Creating updated training data.py:34: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
 df = df[~o]

How can I cleanup this code and run it faster using pandas?

Also, I have a GPU so I can exploit cudf however I am using the python 3.7 environment and cudf does not seem to support it and am not familiar with conda yet.

Question 2

Welcome to Code Review! The current question title, which states your concerns about the code, is too general to be useful here. Please edit to the site standard, which is for the title to simply state the task accomplished by the code (and in the description, please be a bit more specific about what you mean by "clean up"). Please see How to get the best value out of Code Review: Asking Questions for guidance on writing good question titles.

Question 3

Put the part about optimizing into the body of the question. What is the game? What does the code do with the data frame?

Question 4

Thank you for your feedback. I have updated the body of my question. Is there anything else I should be clarifying?

Question 5

(I don't understand one example from the code: as "Talleres - "Defensores Unidos" is not well-formed.)

Question 6

problem is [I wrote] code as and when [needed] One problem may be that you did not write and keep a specification of what to achieve.

Question 7

Any time that you have a section labelled This part, that's a good indication that you should have a function. Code like this should not exist in a flat file with no structure. Among many other reasons,

once one of your subroutines finishes, any intermediate variables that go out of scope like m, n etc. will be eligible for garbage collection
performing profiling will be much more possible - please do this; otherwise you're effectively flying blind
it will be more legible and self-documenting
it will be possible for your IDE to perform folding, etc.

Until you have profiling data it's effectively impossible to say what part of your code is the slowest. Some random guesses:

where you do astype(str), that's likely not necessary
when you run successive predicate selection like

df = df[df.home_score != "."]
df = df[df.home_score != ".."]

you should instead do only one [] indexing operation, and combine your predicates logically using &

See if there is a performance difference if you prefer in-place operations, including dropping columns and NA, which are both possible in-place
Does your colt actually do anything? Is there any evidence that your column types are non-uniform? Could you, instead of doing a blanket type coerce, target specific columns that need conversion?
Avoid applymap. There is a vectorised strip available. You do this twice - why?

Non-performance topics:

Don't name variables m, n, o - these are opaque and non-descriptive
Rather than writing "this part harmonises ...", you can just write "harmonise ..." - i.e. use the imperative mood which is common in function documentation.

A first stab, addressing some of the above, looks like

import pandas as pd
from datetime import datetime
def harmonize_game(df: pd.DataFrame) -> pd.DataFrame:
 """
 harmonise the game column e.g. "Talleres (R.E) - Defensores Unidos" should
 be split as "Talleres - "Defensores Unidos" and removes any date values in
 the column
 """
 df['game'] = df['game'].astype(str).str.replace(r'(\(\w+\))', '', regex=True)
 df['game'] = df['game'].astype(str).str.replace(r'(\s\d+\S\d+)$', '', regex=True)
 # Remove any numerical values in the league column. Many times the league
 # column has years concatenated which is what we don't want e.g "Brazil
 # Copa do Nordeste 2020" should be "Brazil Copa do Nordeste"
 df['league'] = df['league'].astype(str).str.replace(r'(\s\d+\S\d+)$', '', regex=True)
 # Split the game column into tow competing teams i.e. home team and away team by the delimiter "-"
 df[['home_team', 'away_team']] = df['game'].str.split(' - ', expand=True, n=1)
 # Split the game column into tow competing teams i.e. home team and away team by the delimiter ":"
 df[['home_score', 'away_score']] = df['score'].str.split(':', expand=True)
 return df
def numerical_scores(df: pd.DataFrame) -> pd.DataFrame:
 """
 Remove any non numerical values in the home score and away score columns.
 e.ge scores can have "aet", "canc", ".", etc. We don't want anything that
 cannot be identified as filetype:int in pandas
 """
 df['away_score'] = df['away_score'].astype(str).str.replace(r'[a-zA-Z\s\D]', '', regex=True)
 df['home_score'] = df['home_score'].astype(str).str.replace(r'[a-zA-Z\s\D]', '', regex=True)
 df = df[df.home_score != "."]
 df = df[df.home_score != ".."]
 df = df[df.home_score != "."]
 df = df[df.home_odds != "-"]
 df = df[df.draw_odds != "-"]
 df = df[df.away_odds != "-"]
 m = df[['home_odds', 'draw_odds', 'away_odds']].astype(str).agg(lambda x: x.str.count('/'), 1).ne(0).all(1)
 n = df[['home_score']].agg(lambda x: x.str.count('-'), 1).ne(0).all(1)
 o = df[['away_score']].agg(lambda x: x.str.count('-'), 1).ne(0).all(1)
 df = df[~m]
 df = df[~n]
 df = df[~o]
 df = df[df.home_score != '']
 df = df[df.away_score != '']
 df = df.dropna()
 return df
def coerce_columns(df: pd.DataFrame) -> pd.DataFrame:
 """Keep only the columns we want"""
 df = df.loc[
 :, df.columns.intersection([
 'datetime', 'country', 'league', 'home_team', 'away_team',
 'home_odds', 'draw_odds', 'away_odds', 'home_score',
 'away_score',
 ])
 ]
 # Make sure that the columns are as per data types that we would want pandas
 # to identify. Pandas does not seem to do a very good job identifying data
 # types correctly.
 colt = {
 'country': str,
 'league': str,
 'home_team': str,
 'away_team': str,
 'home_odds': float,
 'draw_odds': float,
 'away_odds': float,
 'home_score': int,
 'away_score': int
 }
 df = df.astype(colt)
 return df
def strip_strings(df: pd.DataFrame) -> pd.DataFrame:
 """remove any leading and trailing whitespaces in the string columns"""
 return df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
def clean_odds(df: pd.DataFrame) -> pd.DataFrame:
 """Cleaning data where odds are greater than 100 and less than -1 and dropping duplicates"""
 df = df[df['home_odds'] <= 100]
 df = df[df['draw_odds'] <= 100]
 df = df[df['away_odds'] <= 100]
 df = df.drop_duplicates(
 ['datetime', 'home_score', 'away_score', 'country', 'league', 'home_team', 'away_team'],
 keep='last',
 )
 df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
 return df
def clean(df: pd.DataFrame) -> pd.DataFrame:
 df = harmonize_game(df)
 df = numerical_scores(df)
 df = coerce_columns(df)
 df = strip_strings(df)
 df = clean_odds(df)
 return df
def test() -> None:
 df = pd.DataFrame(
 (
 (0, 'Sport Recife - Imperatriz', '2:2', 1.36, 4.31, 7.66, 'Brazil', 'Copa do Nordeste 2020', datetime.strptime('2020-02-07 00:00:00', '%Y-%m-%d %H:%M:%S')),
 (1, 'ABC - America RN', '2:1', 2.62, 3.30, 2.48, 'Brazil', 'Copa do Nordeste 2020', datetime.strptime('2020-02-02 22:00:00', '%Y-%m-%d %H:%M:%S')),
 (2, 'Frei Paulistano - Nautico', '0:2', 5.19, 3.58, 1.62, 'Brazil', 'Copa do Nordeste 2020', datetime.strptime('2020-02-02 00:00:00', '%Y-%m-%d %H:%M:%S')),
 (3, 'Botafogo PB - Confianca', '1:1', 2.06, 3.16, 3.50, 'Brazil', 'Copa do Nordeste 2020', datetime.strptime('2020-02-02 22:00:00', '%Y-%m-%d %H:%M:%S')),
 (4, 'Fortaleza - Ceara', '1:1', 2.19, 2.98, 3.38, 'Brazil', 'Copa do Nordeste 2020', datetime.strptime('2020-02-02 22:00:00', '%Y-%m-%d %H:%M:%S')),
 ),
 columns=(
 'Unnamed: 0', 'game',
 'score',
 'home_odds',
 'draw_odds',
 'away_odds', 'country', 'league', 'datetime',
 ),
 )
 clean(df)
if __name__ == '__main__':
 test()

Question 8

Thank you. The code is much cleaner and I get that building functions is a better way to approach this. However, I am still getting the same time as my earlier code and I still get the Boolean Reindexing warning. While I understand, the time cannot be improved however, can I resolve the reindexing warning I get at df = df[~n] and df = df[~o] ?

Question 9

It's not strictly guaranteed that the time can't be improved. Again, you need to profile your code.

Question 10

Can I ask for a solution for the same code but using cudf on this platform? How can I ask it?

Question 11

How do I save the resulting dataframe to_csv?

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Accepted Answer · 2021-11-06 16:16:34Z

Any time that you have a section labelled This part, that's a good indication that you should have a function. Code like this should not exist in a flat file with no structure. Among many other reasons,

once one of your subroutines finishes, any intermediate variables that go out of scope like m, n etc. will be eligible for garbage collection
performing profiling will be much more possible - please do this; otherwise you're effectively flying blind
it will be more legible and self-documenting
it will be possible for your IDE to perform folding, etc.

Until you have profiling data it's effectively impossible to say what part of your code is the slowest. Some random guesses:

where you do astype(str), that's likely not necessary
when you run successive predicate selection like

df = df[df.home_score != "."]
df = df[df.home_score != ".."]

you should instead do only one [] indexing operation, and combine your predicates logically using &

See if there is a performance difference if you prefer in-place operations, including dropping columns and NA, which are both possible in-place
Does your colt actually do anything? Is there any evidence that your column types are non-uniform? Could you, instead of doing a blanket type coerce, target specific columns that need conversion?
Avoid applymap. There is a vectorised strip available. You do this twice - why?

Non-performance topics:

Don't name variables m, n, o - these are opaque and non-descriptive
Rather than writing "this part harmonises ...", you can just write "harmonise ..." - i.e. use the imperative mood which is common in function documentation.

A first stab, addressing some of the above, looks like

import pandas as pd
from datetime import datetime
def harmonize_game(df: pd.DataFrame) -> pd.DataFrame:
 """
 harmonise the game column e.g. "Talleres (R.E) - Defensores Unidos" should
 be split as "Talleres - "Defensores Unidos" and removes any date values in
 the column
 """
 df['game'] = df['game'].astype(str).str.replace(r'(\(\w+\))', '', regex=True)
 df['game'] = df['game'].astype(str).str.replace(r'(\s\d+\S\d+)$', '', regex=True)
 # Remove any numerical values in the league column. Many times the league
 # column has years concatenated which is what we don't want e.g "Brazil
 # Copa do Nordeste 2020" should be "Brazil Copa do Nordeste"
 df['league'] = df['league'].astype(str).str.replace(r'(\s\d+\S\d+)$', '', regex=True)
 # Split the game column into tow competing teams i.e. home team and away team by the delimiter "-"
 df[['home_team', 'away_team']] = df['game'].str.split(' - ', expand=True, n=1)
 # Split the game column into tow competing teams i.e. home team and away team by the delimiter ":"
 df[['home_score', 'away_score']] = df['score'].str.split(':', expand=True)
 return df
def numerical_scores(df: pd.DataFrame) -> pd.DataFrame:
 """
 Remove any non numerical values in the home score and away score columns.
 e.ge scores can have "aet", "canc", ".", etc. We don't want anything that
 cannot be identified as filetype:int in pandas
 """
 df['away_score'] = df['away_score'].astype(str).str.replace(r'[a-zA-Z\s\D]', '', regex=True)
 df['home_score'] = df['home_score'].astype(str).str.replace(r'[a-zA-Z\s\D]', '', regex=True)
 df = df[df.home_score != "."]
 df = df[df.home_score != ".."]
 df = df[df.home_score != "."]
 df = df[df.home_odds != "-"]
 df = df[df.draw_odds != "-"]
 df = df[df.away_odds != "-"]
 m = df[['home_odds', 'draw_odds', 'away_odds']].astype(str).agg(lambda x: x.str.count('/'), 1).ne(0).all(1)
 n = df[['home_score']].agg(lambda x: x.str.count('-'), 1).ne(0).all(1)
 o = df[['away_score']].agg(lambda x: x.str.count('-'), 1).ne(0).all(1)
 df = df[~m]
 df = df[~n]
 df = df[~o]
 df = df[df.home_score != '']
 df = df[df.away_score != '']
 df = df.dropna()
 return df
def coerce_columns(df: pd.DataFrame) -> pd.DataFrame:
 """Keep only the columns we want"""
 df = df.loc[
 :, df.columns.intersection([
 'datetime', 'country', 'league', 'home_team', 'away_team',
 'home_odds', 'draw_odds', 'away_odds', 'home_score',
 'away_score',
 ])
 ]
 # Make sure that the columns are as per data types that we would want pandas
 # to identify. Pandas does not seem to do a very good job identifying data
 # types correctly.
 colt = {
 'country': str,
 'league': str,
 'home_team': str,
 'away_team': str,
 'home_odds': float,
 'draw_odds': float,
 'away_odds': float,
 'home_score': int,
 'away_score': int
 }
 df = df.astype(colt)
 return df
def strip_strings(df: pd.DataFrame) -> pd.DataFrame:
 """remove any leading and trailing whitespaces in the string columns"""
 return df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
def clean_odds(df: pd.DataFrame) -> pd.DataFrame:
 """Cleaning data where odds are greater than 100 and less than -1 and dropping duplicates"""
 df = df[df['home_odds'] <= 100]
 df = df[df['draw_odds'] <= 100]
 df = df[df['away_odds'] <= 100]
 df = df.drop_duplicates(
 ['datetime', 'home_score', 'away_score', 'country', 'league', 'home_team', 'away_team'],
 keep='last',
 )
 df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
 return df
def clean(df: pd.DataFrame) -> pd.DataFrame:
 df = harmonize_game(df)
 df = numerical_scores(df)
 df = coerce_columns(df)
 df = strip_strings(df)
 df = clean_odds(df)
 return df
def test() -> None:
 df = pd.DataFrame(
 (
 (0, 'Sport Recife - Imperatriz', '2:2', 1.36, 4.31, 7.66, 'Brazil', 'Copa do Nordeste 2020', datetime.strptime('2020-02-07 00:00:00', '%Y-%m-%d %H:%M:%S')),
 (1, 'ABC - America RN', '2:1', 2.62, 3.30, 2.48, 'Brazil', 'Copa do Nordeste 2020', datetime.strptime('2020-02-02 22:00:00', '%Y-%m-%d %H:%M:%S')),
 (2, 'Frei Paulistano - Nautico', '0:2', 5.19, 3.58, 1.62, 'Brazil', 'Copa do Nordeste 2020', datetime.strptime('2020-02-02 00:00:00', '%Y-%m-%d %H:%M:%S')),
 (3, 'Botafogo PB - Confianca', '1:1', 2.06, 3.16, 3.50, 'Brazil', 'Copa do Nordeste 2020', datetime.strptime('2020-02-02 22:00:00', '%Y-%m-%d %H:%M:%S')),
 (4, 'Fortaleza - Ceara', '1:1', 2.19, 2.98, 3.38, 'Brazil', 'Copa do Nordeste 2020', datetime.strptime('2020-02-02 22:00:00', '%Y-%m-%d %H:%M:%S')),
 ),
 columns=(
 'Unnamed: 0', 'game',
 'score',
 'home_odds',
 'draw_odds',
 'away_odds', 'country', 'league', 'datetime',
 ),
 )
 clean(df)
if __name__ == '__main__':
 test()

Thank you. The code is much cleaner and I get that building functions is a better way to approach this. However, I am still getting the same time as my earlier code and I still get the Boolean Reindexing warning. While I understand, the time cannot be improved however, can I resolve the reindexing warning I get at df = df[~n] and df = df[~o] ?
It's not strictly guaranteed that the time can't be improved. Again, you need to profile your code.
Can I ask for a solution for the same code but using cudf on this platform? How can I ask it?

Stack Exchange Network

Cleanup dataset for visualization

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Cleanup dataset for visualization

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions