2

So I am trying to open a CSV file, read its fields and based on that fix some other fields and then save that data back to csv. My problem is that the CSV file has 2 million rows. What would be the best way to speed this up.
The CSV file consists of

ID; DATE(d/m/y); SPECIAL_ID; DAY; MONTH; YEAR

I am counting how often a row with the same date appears on my record and then update SPECIAL_ID based on that data.

Based on my previous research I decided to use pandas. I'll be processing even bigger sets of data in future (1-2GB) - this one is around 119MB so it crucial I find a good fast solution.

My code goes as follows:

df = pd.read_csv(filename, delimiter=';') 
df_fixed= pd.DataFrame(columns=stolpci) #when I process the row in df I append it do df_fixed
d = 31
m = 12
y = 100
s = (y,m,d)
list_dates= np.zeros(s) #3 dimensional array. 
for index, row in df.iterrows():
 # PROCESSING LOGIC GOES HERE
 # IT CONSISTS OF FEW IF STATEMENTS
 list_dates[row.DAY][row.MONTH][row.YEAR] += 1
 row['special_id'] = list_dates[row.DAY][row.MONTH][row.YEAR]
 df_fixed = df_fixed.append(row.to_frame().T)
df_fixed .to_csv(filename_fixed, sep=';', encoding='utf-8') 

I tried to make a print for every thousand rows processed. At first, my script needs 3 seconds for 1000 rows, but the longer it runs the slower it gets. at row 43000 it needs 29 seconds and so on...

Thanks for all future help :)

EDIT: I am adding additional information about my CSV and exptected output

ID;SPECIAL_ID;sex;age;zone;key;day;month;year
2;13012016505__-;F;1;1001001;1001001_F_1;13;1;2016
3;25122013505__-;F;4;1001001;1001001_F_4;25;12;2013
4;24022012505__-;F;5;1001001;1001001_F_5;24;2;2012
5;09032012505__-;F;5;1001001;1001001_F_5;9;3;2012
6;21082011505__-;F;6;1001001;1001001_F_6;21;8;2011
7;16082011505__-;F;6;1001001;1001001_F_6;16;8;2011
8;21102011505__-;F;6;1001001;1001001_F_6;16;8;2011

I have to replace - in the special ID field to a proper number. For example for a row with ID = 2 the SPECIAL_ID will be 26022018505001 (- got replaced by 001) if someone else in the CSV shares the same DAY, MONTH, YEAR the __- will be replaced by 002 and so on... So exptected output for above rows would be

ID;SPECIAL_ID;sex;age;zone;key;day;month;year
2;13012016505001;F;1;1001001;1001001_F_1;13;1;2016
3;25122013505001;F;4;1001001;1001001_F_4;25;12;2013
4;24022012505001;F;5;1001001;1001001_F_5;24;2;2012
5;09032012505001;F;5;1001001;1001001_F_5;9;3;2012
6;21082011505001;F;6;1001001;1001001_F_6;21;8;2011
7;16082011505001;F;6;1001001;1001001_F_6;16;8;2011
8;21102011505002;F;6;1001001;1001001_F_6;16;8;2011

EDIT: I changed my code to something like this: I fill list of dicts with data and then convert that list do dataframe and save as csv. This will take around 30minutes to complete

list_popravljeni = []
df = pd.read_csv(filename, delimiter=';')
df_dates = df.groupby(by=['dan_roj', 'mesec_roj', 'leto_roj']).size().reset_index() 
 for index, row in df_dates.iterrows():
 df_candidates= df.loc[(df['dan_roj'] == dan_roj) & (df['mesec_roj'] == mesec_roj) & (df['leto_roj'] == leto_roj) ]
 for index, row in df_candidates.iterrows():
 vrstica = {}
 vrstica['ID'] = row['identifikator']
 vrstica['SPECIAL_ID'] = row['emso'][0:11] + str(index).zfill(2)
 vrstica['day'] = row['day']
 vrstica['MONTH'] = row['MONTH']
 vrstica['YEAR'] = row['YEAR'] 
 list_popravljeni.append(vrstica)
 pd.DataFrame(list_popravljeni, columns=list_popravljeni[0].keys())
asked May 22, 2018 at 8:04
6
  • one way to reduce the search space might be to use pandas slices to replace your IF conditions. E.g. if value > 5 can be sliced to df.loc[df['value']>5,:] will trim your dataset down to only those rows that match your if conditions. This assumes you are looking for one outcome that matches all the conditions. Commented May 22, 2018 at 8:15
  • Can you please include a small example of your input file and the expected output in this case? DataFrames are not optimised for for loops or being appended to millions of times, but it's hard to properly visualise a vectorized approach without an example to check against. Commented May 22, 2018 at 8:21
  • 1
    I'm not sure why you can't just use something like df = df.groupby(by=['year', 'month', 'day']).size().reset_index(). This will count occurrences of a particular date. Commented May 22, 2018 at 8:28
  • 1
    @roganjosh I did :) check the edited post Commented May 22, 2018 at 8:33
  • okay @roganjosh I tried as you suggested and groupby works perfectly! I didn't even know what option existed in pandas. Now I have to assign those numbers to actual records.. Commented May 22, 2018 at 8:42

1 Answer 1

2

I think this gives what you're looking for and avoids looping. Potentially it could be more efficient (I wasn't able to find a way to avoid creating counts). However, it should be much faster than your current approach.

df['counts'] = df.groupby(['year', 'month', 'day'])['SPECIAL_ID'].cumcount() + 1
df['counts'] = df['counts'].astype(str)
df['counts'] = df['counts'].str.zfill(3)
df['SPECIAL_ID'] = df['SPECIAL_ID'].str.slice(0, -3).str.cat(df['counts'])

I added a fake record at the end to confirm it does increment properly:

 SPECIAL_ID sex age zone key day month year counts
0 13012016505001 F 1 1001001 1001001_F_1 13 1 2016 001
1 25122013505001 F 4 1001001 1001001_F_4 25 12 2013 001
2 24022012505001 F 5 1001001 1001001_F_5 24 2 2012 001
3 09032012505001 F 5 1001001 1001001_F_5 9 3 2012 001
4 21082011505001 F 6 1001001 1001001_F_6 21 8 2011 001
5 16082011505001 F 6 1001001 1001001_F_6 16 8 2011 001
6 21102011505002 F 6 1001001 1001001_F_6 16 8 2011 002
7 21102012505003 F 6 1001001 1001001_F_6 16 8 2011 003

If you want to get rid of counts, you just need:

df.drop('counts', inplace=True, axis=1)
answered May 22, 2018 at 10:38
Sign up to request clarification or add additional context in comments.

6 Comments

first of all, thanks for help! and yes count should increment. record#5[SPECIAL_ID] should 16082011505001 and record#6[SPECIAL_ID] should be 16082011505002
@Kristjan but it increments specifically based on the ordering in the ID column? Because I think groupby will destroy any of the initial ordering so we need something fixed to base this on, otherwise it'll be arbitrary which gets 001 and which gets 002
at this stage it doesn't matter who gets which number. I just need all records for a given day get number 1 - len(records_on_given_day) I managed to speed it up. I think it will take around 30-40minutes to complete... but still better than my old way. I'll post my solution as an answer when I find some time... If you think even faster way is possible my ears are open
@Kristjan I have fixed the issue. I expect my approach to be in the order of seconds, not minutes.
Great, I incorporated your answer into my code and it works perfectly and really fast! I'll be doing a lot of work like that above in the future, so if you could point me to a good learning resource it would be really great.
|

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.