2

i've a big database imported from a csv file (using pd.read_csv), here's how it look in csv file:

 0 1 2
0 Milan Draw Juventus
1 2.47 3.24 3.03
2 2.45 3.23 3.06
0 Napoli Draw Parma
1 1.45 4.41 7.38
2 1.45 4.40 7.36
3 1.46 4.39 7.33
4 1.47 4.33 7.14
5 1.47 4.33 7.13
6 1.47 4.34 7.10
7 1.43 4.54 7.70
0 Fiorentina Draw Pisa
1 2.86 3.50 2.45
2 2.92 3.51 2.40
3 3.14 3.55 2.25
4 2.79 3.45 2.61

I need the dataframe to look like this:

 0 1 2 3 4
0 Milan Juventus 2.47 3.24 3.03
1 Milan Juventus 2.45 3.23 3.06
2 Napoli Parma 1.45 4.41 7.38
3 Napoli Parma 1.45 4.40 7.36
4 Napoli Parma 1.46 4.39 7.33
5 Napoli Parma 1.47 4.33 7.14
6 Napoli Parma 1.47 4.33 7.13
7 Napoli Parma 1.47 4.34 7.10
8 Napoli Parma 1.43 4.54 7.70
9 Fiorentina Pisa 2.86 3.50 2.45
10 Fiorentina Pisa 2.92 3.51 2.40
11 Fiorentina Pisa 3.14 3.55 2.25
12 Fiorentina Pisa 2.79 3.45 2.61

It's very easy to do it in excel with a formula but i would like to be able to do it in python since the csv file is very very big so managing with pandas is way faster but i don't know if it is possible nd how to do it...thanks!

Doing as suggested df=pd.read_csv(r"G:\PLUTO\odds2.csv",sep=",") got me this

 Unnamed: 0 0 1 2
0 0 Milan Draw Juventus
1 1 2.88 3.58 2.46
2 2 2.84 3.56 2.5
3 0 Napoli Draw Parma
4 1 2.44 3.35 3.08
5 2 2.5 3.3 3.03
6 3 2.48 3.31 3.05
7 4 2.49 3.3 3.05
8 5 2.46 3.38 3.02
9 6 2.49 3.37 2.99
10 7 2.48 3.4 2.98
11 0 Fiorentina Draw Pisa
12 1 3.05 3.23 2.53
13 2 3.04 3.24 2.53
14 3 3.22 3.25 2.41
15 4 3.23 3.24 2.41

Both method worked out adding "index_col=0" in the read-csv:

df = pd.read_csv(r"G:\PLUTO\odds.csv", sep=",", index_col=0)

Thanks to both guys!

asked Dec 23, 2022 at 18:12

2 Answers 2

3

Here's a way to do what your question asks:

df = pd.read_csv('eestlane.txt', sep=r"\s+")
df = df.reset_index().rename(columns={'index':'zero_for_names'})
df[['new1','new2']] = df.loc[df['zero_for_names'] == 0, ['0','1']].reindex(df.index, method='ffill')
df = df[df['zero_for_names'] != 0].drop(columns='zero_for_names').reset_index(drop = True)
df=df[['new1','new2','0','1','2']]
df.columns=[str(i) for i in range(len(df.columns))]

Output:

 0 1 2 3 4
0 Milan Draw 2.47 3.24 3.03
1 Milan Draw 2.45 3.23 3.06
2 Napoli Draw 1.45 4.41 7.38
3 Napoli Draw 1.45 4.40 7.36
4 Napoli Draw 1.46 4.39 7.33
5 Napoli Draw 1.47 4.33 7.14
6 Napoli Draw 1.47 4.33 7.13
7 Napoli Draw 1.47 4.34 7.10
8 Napoli Draw 1.43 4.54 7.70
9 Fiorentina Draw 2.86 3.50 2.45
10 Fiorentina Draw 2.92 3.51 2.40
11 Fiorentina Draw 3.14 3.55 2.25
12 Fiorentina Draw 2.79 3.45 2.61

Explanation:

  • use read_csv to get a 3-column dataframe with an index that contains 0 only for rows with names
  • use reset_index to get an index without duplicates, and rename to change the original index to a column named zero_for_names
  • create two new columns new1, new2 and use masking on zero_for_names together with reindex and its ffill method arg to prepare these columns to be the first two columns of the target output specified in the question
  • use zero_for_names to filter out original name rows, then drop this column and use reset_index to get a new index without gaps
  • rearrange the columns into the desired order
  • update df.columns to match the desired column names (integers as strings) shown in the question.
answered Dec 23, 2022 at 19:47
Sign up to request clarification or add additional context in comments.

1 Comment

Hi, don't know if i'm doing something wrong but got this error: KeyError: "None of [Index(['0', '1'], dtype='object')] are in the [columns]" at the line "df[['new1','new2']] = df.loc[df['ze..." Also noticed that after reading csv file wit sep=r'\s+' the result is only one column with all values separeted by a comma and a new index
1

Try:

def isfloat(x):
 try:
 float(x)
 return True
 except ValueError:
 return False
df = pd.read_csv("your_file.csv", sep=r"\s+") # <-- you may to adjust sep= accordingly
# make sure the columns are of type int
df.columns = map(int, df.columns)
mask = df.applymap(isfloat)
x = df[mask].copy()
df[mask] = np.nan
df[[3, 4, 5]] = x
df[[0, 1, 2]] = df[[0, 1, 2]].ffill()
df = df.dropna().reset_index(drop=True).drop(columns=1)
df.columns = range(len(df.columns))
print(df)

Prints:

 0 1 2 3 4
0 Milan Juventus 2.47 3.24 3.03
1 Milan Juventus 2.45 3.23 3.06
2 Napoli Parma 1.45 4.41 7.38
3 Napoli Parma 1.45 4.40 7.36
4 Napoli Parma 1.46 4.39 7.33
5 Napoli Parma 1.47 4.33 7.14
6 Napoli Parma 1.47 4.33 7.13
7 Napoli Parma 1.47 4.34 7.10
8 Napoli Parma 1.43 4.54 7.70
9 Fiorentina Pisa 2.86 3.50 2.45
10 Fiorentina Pisa 2.92 3.51 2.40
11 Fiorentina Pisa 3.14 3.55 2.25
12 Fiorentina Pisa 2.79 3.45 2.61
answered Dec 23, 2022 at 18:34

5 Comments

Also with your method i'm getting thsi error: " ValueError: invalid literal for int() with base 10: ',Home,Draw,Away' " at line df.columns = map(int, df.columns)
@eestlane You probably want to adjust the parameter to sep=","
Sorry for being dumb, edited the result adjusting as you suggested, but always got same error ValueError: invalid literal for int() with base 10: 'Unnamed: 0'" when executing df.columns = map(int, df.columns)
@eestlane When loading the dataframe try to select only columns you want: df = pd.read_csv("your_file.csv", sep=r",")[['0', '1', '2']]
thanks, it worked as i edited my original post by only adding "index_col=0", also for you method, appreciated the help!

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.