Python processing CSV file really slow

Question 1

So I am trying to open a CSV file, read its fields and based on that fix some other fields and then save that data back to csv. My problem is that the CSV file has 2 million rows. What would be the best way to speed this up.
The CSV file consists of

ID; DATE(d/m/y); SPECIAL_ID; DAY; MONTH; YEAR

I am counting how often a row with the same date appears on my record and then update SPECIAL_ID based on that data.

Based on my previous research I decided to use pandas. I'll be processing even bigger sets of data in future (1-2GB) - this one is around 119MB so it crucial I find a good fast solution.

My code goes as follows:

df = pd.read_csv(filename, delimiter=';') 
df_fixed= pd.DataFrame(columns=stolpci) #when I process the row in df I append it do df_fixed
d = 31
m = 12
y = 100
s = (y,m,d)
list_dates= np.zeros(s) #3 dimensional array. 
for index, row in df.iterrows():
 # PROCESSING LOGIC GOES HERE
 # IT CONSISTS OF FEW IF STATEMENTS
 list_dates[row.DAY][row.MONTH][row.YEAR] += 1
 row['special_id'] = list_dates[row.DAY][row.MONTH][row.YEAR]
 df_fixed = df_fixed.append(row.to_frame().T)
df_fixed .to_csv(filename_fixed, sep=';', encoding='utf-8')

I tried to make a print for every thousand rows processed. At first, my script needs 3 seconds for 1000 rows, but the longer it runs the slower it gets. at row 43000 it needs 29 seconds and so on...

Thanks for all future help :)

EDIT: I am adding additional information about my CSV and exptected output

ID;SPECIAL_ID;sex;age;zone;key;day;month;year
2;13012016505__-;F;1;1001001;1001001_F_1;13;1;2016
3;25122013505__-;F;4;1001001;1001001_F_4;25;12;2013
4;24022012505__-;F;5;1001001;1001001_F_5;24;2;2012
5;09032012505__-;F;5;1001001;1001001_F_5;9;3;2012
6;21082011505__-;F;6;1001001;1001001_F_6;21;8;2011
7;16082011505__-;F;6;1001001;1001001_F_6;16;8;2011
8;21102011505__-;F;6;1001001;1001001_F_6;16;8;2011

I have to replace - in the special ID field to a proper number. For example for a row with ID = 2 the SPECIAL_ID will be 26022018505001 (- got replaced by 001) if someone else in the CSV shares the same DAY, MONTH, YEAR the __- will be replaced by 002 and so on... So exptected output for above rows would be

ID;SPECIAL_ID;sex;age;zone;key;day;month;year
2;13012016505001;F;1;1001001;1001001_F_1;13;1;2016
3;25122013505001;F;4;1001001;1001001_F_4;25;12;2013
4;24022012505001;F;5;1001001;1001001_F_5;24;2;2012
5;09032012505001;F;5;1001001;1001001_F_5;9;3;2012
6;21082011505001;F;6;1001001;1001001_F_6;21;8;2011
7;16082011505001;F;6;1001001;1001001_F_6;16;8;2011
8;21102011505002;F;6;1001001;1001001_F_6;16;8;2011

EDIT: I changed my code to something like this: I fill list of dicts with data and then convert that list do dataframe and save as csv. This will take around 30minutes to complete

list_popravljeni = []
df = pd.read_csv(filename, delimiter=';')
df_dates = df.groupby(by=['dan_roj', 'mesec_roj', 'leto_roj']).size().reset_index() 
 for index, row in df_dates.iterrows():
 df_candidates= df.loc[(df['dan_roj'] == dan_roj) & (df['mesec_roj'] == mesec_roj) & (df['leto_roj'] == leto_roj) ]
 for index, row in df_candidates.iterrows():
 vrstica = {}
 vrstica['ID'] = row['identifikator']
 vrstica['SPECIAL_ID'] = row['emso'][0:11] + str(index).zfill(2)
 vrstica['day'] = row['day']
 vrstica['MONTH'] = row['MONTH']
 vrstica['YEAR'] = row['YEAR'] 
 list_popravljeni.append(vrstica)
 pd.DataFrame(list_popravljeni, columns=list_popravljeni[0].keys())

Question 2

one way to reduce the search space might be to use pandas slices to replace your IF conditions. E.g. if value > 5 can be sliced to df.loc[df['value']>5,:] will trim your dataset down to only those rows that match your if conditions. This assumes you are looking for one outcome that matches all the conditions.

Question 3

Can you please include a small example of your input file and the expected output in this case? DataFrames are not optimised for for loops or being appended to millions of times, but it's hard to properly visualise a vectorized approach without an example to check against.

Question 4

I'm not sure why you can't just use something like df = df.groupby(by=['year', 'month', 'day']).size().reset_index(). This will count occurrences of a particular date.

Question 5

@roganjosh I did :) check the edited post

Question 6

okay @roganjosh I tried as you suggested and groupby works perfectly! I didn't even know what option existed in pandas. Now I have to assign those numbers to actual records..

Question 7

I think this gives what you're looking for and avoids looping. Potentially it could be more efficient (I wasn't able to find a way to avoid creating counts). However, it should be much faster than your current approach.

df['counts'] = df.groupby(['year', 'month', 'day'])['SPECIAL_ID'].cumcount() + 1
df['counts'] = df['counts'].astype(str)
df['counts'] = df['counts'].str.zfill(3)
df['SPECIAL_ID'] = df['SPECIAL_ID'].str.slice(0, -3).str.cat(df['counts'])

I added a fake record at the end to confirm it does increment properly:

 SPECIAL_ID sex age zone key day month year counts
0 13012016505001 F 1 1001001 1001001_F_1 13 1 2016 001
1 25122013505001 F 4 1001001 1001001_F_4 25 12 2013 001
2 24022012505001 F 5 1001001 1001001_F_5 24 2 2012 001
3 09032012505001 F 5 1001001 1001001_F_5 9 3 2012 001
4 21082011505001 F 6 1001001 1001001_F_6 21 8 2011 001
5 16082011505001 F 6 1001001 1001001_F_6 16 8 2011 001
6 21102011505002 F 6 1001001 1001001_F_6 16 8 2011 002
7 21102012505003 F 6 1001001 1001001_F_6 16 8 2011 003

If you want to get rid of counts, you just need:

df.drop('counts', inplace=True, axis=1)

Question 8

first of all, thanks for help! and yes count should increment. record#5[SPECIAL_ID] should 16082011505001 and record#6[SPECIAL_ID] should be 16082011505002

Question 9

@Kristjan but it increments specifically based on the ordering in the ID column? Because I think groupby will destroy any of the initial ordering so we need something fixed to base this on, otherwise it'll be arbitrary which gets 001 and which gets 002

Question 10

at this stage it doesn't matter who gets which number. I just need all records for a given day get number 1 - len(records_on_given_day) I managed to speed it up. I think it will take around 30-40minutes to complete... but still better than my old way. I'll post my solution as an answer when I find some time... If you think even faster way is possible my ears are open

Question 11

@Kristjan I have fixed the issue. I expect my approach to be in the order of seconds, not minutes.

Question 12

Great, I incorporated your answer into my code and it works perfectly and really fast! I'll be doing a lot of work like that above in the future, so if you could point me to a good learning resource it would be really great.

roganjosh 13.4k4 gold badges33 silver badges53 bronze badges · Accepted Answer · 2018-05-22 10:38:43Z

I think this gives what you're looking for and avoids looping. Potentially it could be more efficient (I wasn't able to find a way to avoid creating counts). However, it should be much faster than your current approach.

df['counts'] = df.groupby(['year', 'month', 'day'])['SPECIAL_ID'].cumcount() + 1
df['counts'] = df['counts'].astype(str)
df['counts'] = df['counts'].str.zfill(3)
df['SPECIAL_ID'] = df['SPECIAL_ID'].str.slice(0, -3).str.cat(df['counts'])

I added a fake record at the end to confirm it does increment properly:

 SPECIAL_ID sex age zone key day month year counts
0 13012016505001 F 1 1001001 1001001_F_1 13 1 2016 001
1 25122013505001 F 4 1001001 1001001_F_4 25 12 2013 001
2 24022012505001 F 5 1001001 1001001_F_5 24 2 2012 001
3 09032012505001 F 5 1001001 1001001_F_5 9 3 2012 001
4 21082011505001 F 6 1001001 1001001_F_6 21 8 2011 001
5 16082011505001 F 6 1001001 1001001_F_6 16 8 2011 001
6 21102011505002 F 6 1001001 1001001_F_6 16 8 2011 002
7 21102012505003 F 6 1001001 1001001_F_6 16 8 2011 003

If you want to get rid of counts, you just need:

df.drop('counts', inplace=True, axis=1)

first of all, thanks for help! and yes count should increment. record#5[SPECIAL_ID] should 16082011505001 and record#6[SPECIAL_ID] should be 16082011505002
@Kristjan but it increments specifically based on the ordering in the ID column? Because I think groupby will destroy any of the initial ordering so we need something fixed to base this on, otherwise it'll be arbitrary which gets 001 and which gets 002
at this stage it doesn't matter who gets which number. I just need all records for a given day get number 1 - len(records_on_given_day) I managed to speed it up. I think it will take around 30-40minutes to complete... but still better than my old way. I'll post my solution as an answer when I find some time... If you think even faster way is possible my ears are open
@Kristjan I have fixed the issue. I expect my approach to be in the order of seconds, not minutes.
Great, I incorporated your answer into my code and it works perfectly and really fast! I'll be doing a lot of work like that above in the future, so if you could point me to a good learning resource it would be really great.

CollectivesTM on Stack Overflow

Python processing CSV file really slow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related