i've a big database imported from a csv file (using pd.read_csv), here's how it look in csv file:
0 1 2
0 Milan Draw Juventus
1 2.47 3.24 3.03
2 2.45 3.23 3.06
0 Napoli Draw Parma
1 1.45 4.41 7.38
2 1.45 4.40 7.36
3 1.46 4.39 7.33
4 1.47 4.33 7.14
5 1.47 4.33 7.13
6 1.47 4.34 7.10
7 1.43 4.54 7.70
0 Fiorentina Draw Pisa
1 2.86 3.50 2.45
2 2.92 3.51 2.40
3 3.14 3.55 2.25
4 2.79 3.45 2.61
I need the dataframe to look like this:
0 1 2 3 4
0 Milan Juventus 2.47 3.24 3.03
1 Milan Juventus 2.45 3.23 3.06
2 Napoli Parma 1.45 4.41 7.38
3 Napoli Parma 1.45 4.40 7.36
4 Napoli Parma 1.46 4.39 7.33
5 Napoli Parma 1.47 4.33 7.14
6 Napoli Parma 1.47 4.33 7.13
7 Napoli Parma 1.47 4.34 7.10
8 Napoli Parma 1.43 4.54 7.70
9 Fiorentina Pisa 2.86 3.50 2.45
10 Fiorentina Pisa 2.92 3.51 2.40
11 Fiorentina Pisa 3.14 3.55 2.25
12 Fiorentina Pisa 2.79 3.45 2.61
It's very easy to do it in excel with a formula but i would like to be able to do it in python since the csv file is very very big so managing with pandas is way faster but i don't know if it is possible nd how to do it...thanks!
Doing as suggested df=pd.read_csv(r"G:\PLUTO\odds2.csv",sep=",") got me this
Unnamed: 0 0 1 2
0 0 Milan Draw Juventus
1 1 2.88 3.58 2.46
2 2 2.84 3.56 2.5
3 0 Napoli Draw Parma
4 1 2.44 3.35 3.08
5 2 2.5 3.3 3.03
6 3 2.48 3.31 3.05
7 4 2.49 3.3 3.05
8 5 2.46 3.38 3.02
9 6 2.49 3.37 2.99
10 7 2.48 3.4 2.98
11 0 Fiorentina Draw Pisa
12 1 3.05 3.23 2.53
13 2 3.04 3.24 2.53
14 3 3.22 3.25 2.41
15 4 3.23 3.24 2.41
Both method worked out adding "index_col=0" in the read-csv:
df = pd.read_csv(r"G:\PLUTO\odds.csv", sep=",", index_col=0)
Thanks to both guys!
2 Answers 2
Here's a way to do what your question asks:
df = pd.read_csv('eestlane.txt', sep=r"\s+")
df = df.reset_index().rename(columns={'index':'zero_for_names'})
df[['new1','new2']] = df.loc[df['zero_for_names'] == 0, ['0','1']].reindex(df.index, method='ffill')
df = df[df['zero_for_names'] != 0].drop(columns='zero_for_names').reset_index(drop = True)
df=df[['new1','new2','0','1','2']]
df.columns=[str(i) for i in range(len(df.columns))]
Output:
0 1 2 3 4
0 Milan Draw 2.47 3.24 3.03
1 Milan Draw 2.45 3.23 3.06
2 Napoli Draw 1.45 4.41 7.38
3 Napoli Draw 1.45 4.40 7.36
4 Napoli Draw 1.46 4.39 7.33
5 Napoli Draw 1.47 4.33 7.14
6 Napoli Draw 1.47 4.33 7.13
7 Napoli Draw 1.47 4.34 7.10
8 Napoli Draw 1.43 4.54 7.70
9 Fiorentina Draw 2.86 3.50 2.45
10 Fiorentina Draw 2.92 3.51 2.40
11 Fiorentina Draw 3.14 3.55 2.25
12 Fiorentina Draw 2.79 3.45 2.61
Explanation:
- use
read_csvto get a 3-column dataframe with an index that contains 0 only for rows with names - use
reset_indexto get an index without duplicates, andrenameto change the original index to a column namedzero_for_names - create two new columns
new1, new2and use masking onzero_for_namestogether withreindexand itsffillmethod arg to prepare these columns to be the first two columns of the target output specified in the question - use
zero_for_namesto filter out original name rows, then drop this column and usereset_indexto get a new index without gaps - rearrange the columns into the desired order
- update
df.columnsto match the desired column names (integers as strings) shown in the question.
1 Comment
Try:
def isfloat(x):
try:
float(x)
return True
except ValueError:
return False
df = pd.read_csv("your_file.csv", sep=r"\s+") # <-- you may to adjust sep= accordingly
# make sure the columns are of type int
df.columns = map(int, df.columns)
mask = df.applymap(isfloat)
x = df[mask].copy()
df[mask] = np.nan
df[[3, 4, 5]] = x
df[[0, 1, 2]] = df[[0, 1, 2]].ffill()
df = df.dropna().reset_index(drop=True).drop(columns=1)
df.columns = range(len(df.columns))
print(df)
Prints:
0 1 2 3 4
0 Milan Juventus 2.47 3.24 3.03
1 Milan Juventus 2.45 3.23 3.06
2 Napoli Parma 1.45 4.41 7.38
3 Napoli Parma 1.45 4.40 7.36
4 Napoli Parma 1.46 4.39 7.33
5 Napoli Parma 1.47 4.33 7.14
6 Napoli Parma 1.47 4.33 7.13
7 Napoli Parma 1.47 4.34 7.10
8 Napoli Parma 1.43 4.54 7.70
9 Fiorentina Pisa 2.86 3.50 2.45
10 Fiorentina Pisa 2.92 3.51 2.40
11 Fiorentina Pisa 3.14 3.55 2.25
12 Fiorentina Pisa 2.79 3.45 2.61
5 Comments
sep=","df = pd.read_csv("your_file.csv", sep=r",")[['0', '1', '2']]