Modify pandas dataframe imported from csv file

Question 1

i've a big database imported from a csv file (using pd.read_csv), here's how it look in csv file:

 0 1 2
0 Milan Draw Juventus
1 2.47 3.24 3.03
2 2.45 3.23 3.06
0 Napoli Draw Parma
1 1.45 4.41 7.38
2 1.45 4.40 7.36
3 1.46 4.39 7.33
4 1.47 4.33 7.14
5 1.47 4.33 7.13
6 1.47 4.34 7.10
7 1.43 4.54 7.70
0 Fiorentina Draw Pisa
1 2.86 3.50 2.45
2 2.92 3.51 2.40
3 3.14 3.55 2.25
4 2.79 3.45 2.61

I need the dataframe to look like this:

 0 1 2 3 4
0 Milan Juventus 2.47 3.24 3.03
1 Milan Juventus 2.45 3.23 3.06
2 Napoli Parma 1.45 4.41 7.38
3 Napoli Parma 1.45 4.40 7.36
4 Napoli Parma 1.46 4.39 7.33
5 Napoli Parma 1.47 4.33 7.14
6 Napoli Parma 1.47 4.33 7.13
7 Napoli Parma 1.47 4.34 7.10
8 Napoli Parma 1.43 4.54 7.70
9 Fiorentina Pisa 2.86 3.50 2.45
10 Fiorentina Pisa 2.92 3.51 2.40
11 Fiorentina Pisa 3.14 3.55 2.25
12 Fiorentina Pisa 2.79 3.45 2.61

It's very easy to do it in excel with a formula but i would like to be able to do it in python since the csv file is very very big so managing with pandas is way faster but i don't know if it is possible nd how to do it...thanks!

Doing as suggested df=pd.read_csv(r"G:\PLUTO\odds2.csv",sep=",") got me this

 Unnamed: 0 0 1 2
0 0 Milan Draw Juventus
1 1 2.88 3.58 2.46
2 2 2.84 3.56 2.5
3 0 Napoli Draw Parma
4 1 2.44 3.35 3.08
5 2 2.5 3.3 3.03
6 3 2.48 3.31 3.05
7 4 2.49 3.3 3.05
8 5 2.46 3.38 3.02
9 6 2.49 3.37 2.99
10 7 2.48 3.4 2.98
11 0 Fiorentina Draw Pisa
12 1 3.05 3.23 2.53
13 2 3.04 3.24 2.53
14 3 3.22 3.25 2.41
15 4 3.23 3.24 2.41

Both method worked out adding "index_col=0" in the read-csv:

df = pd.read_csv(r"G:\PLUTO\odds.csv", sep=",", index_col=0)

Thanks to both guys!

Question 2

Here's a way to do what your question asks:

df = pd.read_csv('eestlane.txt', sep=r"\s+")
df = df.reset_index().rename(columns={'index':'zero_for_names'})
df[['new1','new2']] = df.loc[df['zero_for_names'] == 0, ['0','1']].reindex(df.index, method='ffill')
df = df[df['zero_for_names'] != 0].drop(columns='zero_for_names').reset_index(drop = True)
df=df[['new1','new2','0','1','2']]
df.columns=[str(i) for i in range(len(df.columns))]

Output:

 0 1 2 3 4
0 Milan Draw 2.47 3.24 3.03
1 Milan Draw 2.45 3.23 3.06
2 Napoli Draw 1.45 4.41 7.38
3 Napoli Draw 1.45 4.40 7.36
4 Napoli Draw 1.46 4.39 7.33
5 Napoli Draw 1.47 4.33 7.14
6 Napoli Draw 1.47 4.33 7.13
7 Napoli Draw 1.47 4.34 7.10
8 Napoli Draw 1.43 4.54 7.70
9 Fiorentina Draw 2.86 3.50 2.45
10 Fiorentina Draw 2.92 3.51 2.40
11 Fiorentina Draw 3.14 3.55 2.25
12 Fiorentina Draw 2.79 3.45 2.61

Explanation:

use read_csv to get a 3-column dataframe with an index that contains 0 only for rows with names
use reset_index to get an index without duplicates, and rename to change the original index to a column named zero_for_names
create two new columns new1, new2 and use masking on zero_for_names together with reindex and its ffill method arg to prepare these columns to be the first two columns of the target output specified in the question
use zero_for_names to filter out original name rows, then drop this column and use reset_index to get a new index without gaps
rearrange the columns into the desired order
update df.columns to match the desired column names (integers as strings) shown in the question.

Question 3

Hi, don't know if i'm doing something wrong but got this error: KeyError: "None of [Index(['0', '1'], dtype='object')] are in the [columns]" at the line "df[['new1','new2']] = df.loc[df['ze..." Also noticed that after reading csv file wit sep=r'\s+' the result is only one column with all values separeted by a comma and a new index

Question 4

Try:

def isfloat(x):
 try:
 float(x)
 return True
 except ValueError:
 return False
df = pd.read_csv("your_file.csv", sep=r"\s+") # <-- you may to adjust sep= accordingly
# make sure the columns are of type int
df.columns = map(int, df.columns)
mask = df.applymap(isfloat)
x = df[mask].copy()
df[mask] = np.nan
df[[3, 4, 5]] = x
df[[0, 1, 2]] = df[[0, 1, 2]].ffill()
df = df.dropna().reset_index(drop=True).drop(columns=1)
df.columns = range(len(df.columns))
print(df)

Prints:

 0 1 2 3 4
0 Milan Juventus 2.47 3.24 3.03
1 Milan Juventus 2.45 3.23 3.06
2 Napoli Parma 1.45 4.41 7.38
3 Napoli Parma 1.45 4.40 7.36
4 Napoli Parma 1.46 4.39 7.33
5 Napoli Parma 1.47 4.33 7.14
6 Napoli Parma 1.47 4.33 7.13
7 Napoli Parma 1.47 4.34 7.10
8 Napoli Parma 1.43 4.54 7.70
9 Fiorentina Pisa 2.86 3.50 2.45
10 Fiorentina Pisa 2.92 3.51 2.40
11 Fiorentina Pisa 3.14 3.55 2.25
12 Fiorentina Pisa 2.79 3.45 2.61

Question 5

Also with your method i'm getting thsi error: " ValueError: invalid literal for int() with base 10: ',Home,Draw,Away' " at line df.columns = map(int, df.columns)

Question 6

@eestlane You probably want to adjust the parameter to sep=","

Question 7

Sorry for being dumb, edited the result adjusting as you suggested, but always got same error ValueError: invalid literal for int() with base 10: 'Unnamed: 0'" when executing df.columns = map(int, df.columns)

Question 8

@eestlane When loading the dataframe try to select only columns you want: df = pd.read_csv("your_file.csv", sep=r",")[['0', '1', '2']]

Question 9

thanks, it worked as i edited my original post by only adding "index_col=0", also for you method, appreciated the help!

constantstranger 9,4072 gold badges9 silver badges20 bronze badges · Accepted Answer · 2022-12-23 19:47:33Z

Here's a way to do what your question asks:

df = pd.read_csv('eestlane.txt', sep=r"\s+")
df = df.reset_index().rename(columns={'index':'zero_for_names'})
df[['new1','new2']] = df.loc[df['zero_for_names'] == 0, ['0','1']].reindex(df.index, method='ffill')
df = df[df['zero_for_names'] != 0].drop(columns='zero_for_names').reset_index(drop = True)
df=df[['new1','new2','0','1','2']]
df.columns=[str(i) for i in range(len(df.columns))]

Output:

 0 1 2 3 4
0 Milan Draw 2.47 3.24 3.03
1 Milan Draw 2.45 3.23 3.06
2 Napoli Draw 1.45 4.41 7.38
3 Napoli Draw 1.45 4.40 7.36
4 Napoli Draw 1.46 4.39 7.33
5 Napoli Draw 1.47 4.33 7.14
6 Napoli Draw 1.47 4.33 7.13
7 Napoli Draw 1.47 4.34 7.10
8 Napoli Draw 1.43 4.54 7.70
9 Fiorentina Draw 2.86 3.50 2.45
10 Fiorentina Draw 2.92 3.51 2.40
11 Fiorentina Draw 3.14 3.55 2.25
12 Fiorentina Draw 2.79 3.45 2.61

Explanation:

use read_csv to get a 3-column dataframe with an index that contains 0 only for rows with names
use reset_index to get an index without duplicates, and rename to change the original index to a column named zero_for_names
create two new columns new1, new2 and use masking on zero_for_names together with reindex and its ffill method arg to prepare these columns to be the first two columns of the target output specified in the question
use zero_for_names to filter out original name rows, then drop this column and use reset_index to get a new index without gaps
rearrange the columns into the desired order
update df.columns to match the desired column names (integers as strings) shown in the question.

Hi, don't know if i'm doing something wrong but got this error: KeyError: "None of [Index(['0', '1'], dtype='object')] are in the [columns]" at the line "df[['new1','new2']] = df.loc[df['ze..." Also noticed that after reading csv file wit sep=r'\s+' the result is only one column with all values separeted by a comma and a new index

CollectivesTM on Stack Overflow

Modify pandas dataframe imported from csv file

2 Answers 2

1 Comment

5 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

1 Comment

5 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related