Parsing csv in python

Question 1

I'm trying to parse a csv file in python and print the sum of order_total for each day. Below is the sample csv file

 order_total created_datetime 
24.99 2015年06月01日 00:00:12 
0 2015年06月01日 00:03:15 
164.45 2015年06月01日 00:04:05 
24.99 2015年06月01日 00:08:01 
0 2015年06月01日 00:08:23 
46.73 2015年06月01日 00:08:51 
0 2015年06月01日 00:08:58 
47.73 2015年06月02日 00:00:25 
101.74 2015年06月02日 00:04:11 
119.99 2015年06月02日 00:04:35 
38.59 2015年06月02日 00:05:26 
73.47 2015年06月02日 00:06:50 
34.24 2015年06月02日 00:07:36 
27.24 2015年06月03日 00:01:40 
82.2 2015年06月03日 00:12:21 
23.48 2015年06月03日 00:12:35

My objective here is to print the sum(order_total) for each day. For example the result should be

2015年06月01日 -> 261.16
2015年06月02日 -> 415.75
2015年06月03日 -> 132.92

I have written the below code - its does not perform the logic yet, but I'm trying to see if its able to parse and loop as required by printing some sample statements.

def sum_orders_test(self,start_date,end_date):
 initial_date = datetime.date(int(start_date.split('-')[0]),int(start_date.split('-')[1]),int(start_date.split('-')[2]))
 final_date = datetime.date(int(end_date.split('-')[0]),int(end_date.split('-')[1]),int(end_date.split('-')[2]))
 day = datetime.timedelta(days=1)
 with open("file1.csv", 'r') as data_file:
 next(data_file)
 reader = csv.reader(data_file, delimiter=',')
 if initial_date <= final_date:
 for row in reader:
 if str(initial_date) in row[1]:
 print 'initial_date : ' + str(initial_date)
 print 'Date : ' + row[1]
 else:
 print 'Else'
 initial_date = initial_date + day

based on my current logic I'm running into this issue -

As you can see in the sample csv there are 7 rows for 2015年06月01日, 6 rows for 2015年06月02日 and 3 rows for 2015年06月03日.
My output of above code is printing 7 values for 2015年06月01日, 5 for 2015年06月02日 and 2 for 2015年06月03日

Calling the function using sum_orders_test('2015-06-01','2015-06-03');

I know there is some silly logical issue, but being new to programming and python I'm unable to figure it out.

Question 2

delimiter=',')... Please tell me where the commas in the file are

Question 3

its a csv file, and hence used ',', but its not there in file.

Question 4

Have you tried using pandas?

Question 5

That's exactly your problem... Python does not care about file extensions. Change the delimeter so you can actually read the data correctly

Question 6

I've re-read the question, and if your data is really tab-separated, here's the following source to do the job (using pandas):

import pandas as pd
df = pd.DataFrame(pd.read_csv('file.csv', names=['order_total', 'created_datetime'], sep='\t'))
df['created_datetime'] = pd.to_datetime(df.created_datetime).dt.date
df = df.groupby(['created_datetime']).sum()
print(df)

Gives the following result:

 order_total
created_datetime 
2015年06月01日 261.16
2015年06月02日 415.76
2015年06月03日 132.92

Less codes, and probably lower algorithm complexity.

Question 7

It loks much easier, but my file is a csv file, although there isn't any tab or comma in the file. its a normal excel file saved as csv When I replace the '\t' with ',' and run I get below error df['created_datetime'] = pd.to_datetime(df.created_datetime).dt.date File "/Library/Python/2.7/site-packages/pandas/core/tools/datetimes.py", line 509, in to_datetime values = _convert_listlike(arg._values, False, format) File "/Library/Python/2.7/site-packages/pandas/core/tools/datetimes.py", line 447, in _convert_listlike raise e ValueError: Unknown string format @Abien

Question 8

Will you please give a link to a sample of your data?

Question 9

It certainly is :)

Question 10

This one should do the job.

csv module has DictReader, in which you can include fieldnames so instead of accessing columns by index (row[0]), you can predefine columns names(row['date']).

from datetime import datetime, timedelta
from collections import defaultdict
def sum_orders_test(self, start_date, end_date):
 FIELDNAMES = ['orders', 'date']
 sum_of_orders = defaultdict(int)
 initial_date = datetime.strptime(start_date, '%Y-%m-%d').date()
 final_date = datetime.strptime(end_date, '%Y-%m-%d').date()
 day = timedelta(days=1)
 with open("file1.csv", 'r') as data_file:
 next(data_file) # Skip the headers
 reader = csv.DictReader(data_file, fieldnames=FIELDNAMES)
 if initial_date <= final_date:
 for row in reader:
 if str(initial_date) in row['date']:
 sum_of_orders[str(initial_date)] += int(row['orders'])
 else:
 initial_date += day
 return sum_of_orders

Question 11

How does defaultdict work ? When I try to print sum_of_orders it shows defaultdict(<type 'int'>, {}) @Pythonist

Question 12

Simply saying, it allows you to add new keys to a dictionary, of given type, without checking if they're in. Docs will say more than I can.

Question 13

You might have a .csv file extension, but your file seems to be a tab separated file actually.

You can load it as pandas dataframe but specifying the separator.

import pandas as pd
data = pd.read_csv('file.csv', sep='\t')

Then split the datetime column into date and time

data = pd.DataFrame(data.created_datetime.str.split(' ',1).tolist(),
 columns = ['date','time'])

Then for each unique date, compute it's order_total sum

for i in data.date.unique():
 print i, data[data['date'] == i].order_total.sum()

afagarap 6492 gold badges10 silver badges22 bronze badges · Accepted Answer · 2017-09-03 08:27:15Z

2

I've re-read the question, and if your data is really tab-separated, here's the following source to do the job (using pandas):

import pandas as pd
df = pd.DataFrame(pd.read_csv('file.csv', names=['order_total', 'created_datetime'], sep='\t'))
df['created_datetime'] = pd.to_datetime(df.created_datetime).dt.date
df = df.groupby(['created_datetime']).sum()
print(df)

Gives the following result:

 order_total
created_datetime 
2015年06月01日 261.16
2015年06月02日 415.76
2015年06月03日 132.92

Less codes, and probably lower algorithm complexity.

Share

Improve this answer

edited Sep 3, 2017 at 9:05

answered Sep 3, 2017 at 8:27

afagarap's user avatar

afagarap

6492 gold badges10 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Firstname

Firstname Over a year ago

It loks much easier, but my file is a csv file, although there isn't any tab or comma in the file. its a normal excel file saved as csv When I replace the '\t' with ',' and run I get below error df['created_datetime'] = pd.to_datetime(df.created_datetime).dt.date File "/Library/Python/2.7/site-packages/pandas/core/tools/datetimes.py", line 509, in to_datetime values = _convert_listlike(arg._values, False, format) File "/Library/Python/2.7/site-packages/pandas/core/tools/datetimes.py", line 447, in _convert_listlike raise e ValueError: Unknown string format @Abien

2017年09月03日T09:29:09.9Z+00:00

afagarap

afagarap Over a year ago

Will you please give a link to a sample of your data?

2017年09月03日T10:04:27.823Z+00:00

OneCricketeer

OneCricketeer Over a year ago

It certainly is :)

2017年09月03日T16:12:51.647Z+00:00

CollectivesTM on Stack Overflow

Parsing csv in python

3 Answers 3

3 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

3 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related