Parsing a CSV file with Python

Question 1

My goal is to parse this CSV data:

Time,Tank,Product,Volume,TC Vol,Ullage,90 Ul,Height,Water,Temp
2017年10月19日T18:52:41.118408,1,UNLEADED,4406,4393,7221,6058.3,37.49,0,64.15
2017年10月19日T18:52:41.118408,3,SUPER,8317,8278,3310,2147.3,61.4,0,66.74
2017年10月19日T18:52:41.118408,4,ADSL2,6807,6774,4820,3657.3,51.98,0,70.46
2017年10月19日T18:53:13.894066,1,UNLEADED,4406,4393,7221,6058.3,37.49,0,64.15
2017年10月19日T18:53:13.894066,3,SUPER,8313,8273,3314,2151.3,61.37,0,66.74
2017年10月19日T18:53:13.894066,4,ADSL2,6805,6772,4822,3659.3,51.97,0,70.46

Given the a list of Tank numbers, a primary key, and a data column:

>>>tank_numbers = [1, 3]
>>>primary_key = 'Time'
>>>data_column = 'Volume'
>>>
>>>parse_csv('csv_file.csv', tank_numbers, primary_key, data_column)
[
 {'Time': '2017-10-19T18:52:41.118408', 'UNLEADED': '4406', 'SUPER': '8317'}
 {'Time': '2017-10-19T18:53:13.894066', 'UNLEADED': '4406', 'SUPER': '8317'}
]

I have a few questions about the following code;

Using only the standard library, is there a more simple way? What I have seems like too much just to get the needed information.
Should I be breaking up the parse_csv function into smaller pieces similar to _parse_csv_to_dicts and _get_tank_names.

import csv
def _parse_csv_to_dicts(file):
 with open(file, 'r') as f:
 return list(csv.DictReader(f))
def _get_tank_names(tanks=None, data=None):
 names = list()
 for n in tanks:
 for tank_dict in data:
 if tank_dict['Tank'] == str(n):
 names.append(tank_dict['Product'])
 break
 return names
def parse_csv(file, tanks, primary_key, data_key):
 """
 :param file: The raw csv data file
 :param tanks: A list of tank numbers, as labeled in the raw csv
 :param key: a list of the keys needed from the raw csv
 :return: a list of dictionaries
 """
 d1 = _parse_csv_to_dicts(file)
 # Remove unneeded tanks from data
 d2 = [row for row in d1 if int(row['Tank']) in tanks]
 # Remove unneeded keys from rows
 keys = [primary_key, data_key, 'Product']
 d3 = [{key:row[key] for key in keys} for row in d2]
 # Create new row template
 tank_names = _get_tank_names(tanks=tanks, data=d1)
 row_template = {key:None for key in (tank_names + [primary_key])}
 # New rows from row template
 d4 = []
 for row in d3:
 # update new row with available keys
 new_row = {key:row.get(key) for key in row_template}
 # update new row with matching values
 for key in new_row:
 if key in row.values():
 new_row[key] = row[data_key]
 # remove keys with None value
 new_row = {k:v for k,v in new_row.items() if v is not None}
 d4.append(new_row)
 # Merge all rows based on Time key
 merged = {}
 for row in d4:
 if row[primary_key] in merged:
 merged[row[primary_key]].update(row)
 else:
 merged[row[primary_key]] = row
 return [value for value in merged.values()]

Question 2

In general I think this code looks excellent. You have made good use of comprehensions. I have two suggestions:

More Comprehensions:

This:

def _get_tank_names(tanks=None, data=None):
 names = list()
 for n in tanks:
 for tank_dict in data:
 if tank_dict['Tank'] == str(n):
 names.append(tank_dict['Product'])
 break
return names

Can be expressed as:

def _get_tank_names(tanks=None, data=None):
 tanks = {str(n) for n in tanks}
 return {t['Product'] for t in data if t['Tank'] in tanks}

This converts tanks to a set while performing the str() conversion, to clean things up a bit. Then the result can be generated in a single statement.

Gathering like elements

So this entire section:

# Create new row template
tank_names = _get_tank_names(tanks=tanks, data=d1)
row_template = {key: None for key in (tank_names + [primary_key])}
# New rows from row template
d4 = []
for row in d3:
 # update new row with available keys
 new_row = {key: row.get(key) for key in row_template}
 # update new row with matching values
 for key in new_row:
 if key in row.values():
 new_row[key] = row[data_key]
 # remove keys with None value
 new_row = {k: v for k, v in new_row.items() if v is not None}
 d4.append(new_row)
# Merge all rows based on Time key
merged = {}
for row in d4:
 if row[primary_key] in merged:
 merged[row[primary_key]].update(row)
 else:
 merged[row[primary_key]] = row
return [value for value in merged.values()]

Can be reduced down to:

merged = {}
for row in d3:
 merged.setdefault(row['Time'], {'Time': row['Time']}) \
 [row['Product']] = row['Volume']
return [v for k, v in sorted(merged.items())]

The cornerstone here is using dict.setdefault to accumulate the records that match the timestamp.

So how does the setdefault work?

We create an empty dict, the keys to which will be the timestamps we will be accumulating data for.

merged = {}

Loop through the data, a row at a time:

for row in d3:

Look up the element in our dict corresponding to the timestamp for this row:

merged.setdefault(row['Time'], {})

This is most of the magic, it will populate merged[this rows timestamp] with an empty dict if one is not already there, and then return the dict, so that we can put something in it. An empty dict would often be sufficient, but in this case you also need the timestamp in the final dict, so we can extend the statement to be:

merged.setdefault(row['Time'], {'Time': row['Time']})

This will return a dict with the timestamp already populated, the first time we see that timestamp. In subsequent loops when we see the timestamp, it will return the dict that was built the previous time we saw the timestamp.

To finish the dict creation code we set the Product and Volume data into the dict with:

merged.setdefault(row['Time'], {'Time': row['Time']}) \
 [row['Product']] = row['Volume']

Lastly, return the list of dicts we built, sorted by the timestamp with:

return [v for k, v in sorted(merged.items())]

Ain't Python fun? :-)

Question 3

Exactly what I was looking for. Thank you for your edit explaining what you were doing with the setdefault method. Definitely a useful tool. Thank you for your time! Yes, it sure is fun!

Stephen Rauch Stephen Rauch 4,31412 gold badges24 silver badges36 bronze badges · Accepted Answer · 2017-10-22 16:26:23Z

In general I think this code looks excellent. You have made good use of comprehensions. I have two suggestions:

More Comprehensions:

This:

def _get_tank_names(tanks=None, data=None):
 names = list()
 for n in tanks:
 for tank_dict in data:
 if tank_dict['Tank'] == str(n):
 names.append(tank_dict['Product'])
 break
return names

Can be expressed as:

def _get_tank_names(tanks=None, data=None):
 tanks = {str(n) for n in tanks}
 return {t['Product'] for t in data if t['Tank'] in tanks}

This converts tanks to a set while performing the str() conversion, to clean things up a bit. Then the result can be generated in a single statement.

Gathering like elements

So this entire section:

# Create new row template
tank_names = _get_tank_names(tanks=tanks, data=d1)
row_template = {key: None for key in (tank_names + [primary_key])}
# New rows from row template
d4 = []
for row in d3:
 # update new row with available keys
 new_row = {key: row.get(key) for key in row_template}
 # update new row with matching values
 for key in new_row:
 if key in row.values():
 new_row[key] = row[data_key]
 # remove keys with None value
 new_row = {k: v for k, v in new_row.items() if v is not None}
 d4.append(new_row)
# Merge all rows based on Time key
merged = {}
for row in d4:
 if row[primary_key] in merged:
 merged[row[primary_key]].update(row)
 else:
 merged[row[primary_key]] = row
return [value for value in merged.values()]

Can be reduced down to:

merged = {}
for row in d3:
 merged.setdefault(row['Time'], {'Time': row['Time']}) \
 [row['Product']] = row['Volume']
return [v for k, v in sorted(merged.items())]

The cornerstone here is using dict.setdefault to accumulate the records that match the timestamp.

So how does the setdefault work?

We create an empty dict, the keys to which will be the timestamps we will be accumulating data for.

merged = {}

Loop through the data, a row at a time:

for row in d3:

Look up the element in our dict corresponding to the timestamp for this row:

merged.setdefault(row['Time'], {})

This is most of the magic, it will populate merged[this rows timestamp] with an empty dict if one is not already there, and then return the dict, so that we can put something in it. An empty dict would often be sufficient, but in this case you also need the timestamp in the final dict, so we can extend the statement to be:

merged.setdefault(row['Time'], {'Time': row['Time']})

This will return a dict with the timestamp already populated, the first time we see that timestamp. In subsequent loops when we see the timestamp, it will return the dict that was built the previous time we saw the timestamp.

To finish the dict creation code we set the Product and Volume data into the dict with:

merged.setdefault(row['Time'], {'Time': row['Time']}) \
 [row['Product']] = row['Volume']

Lastly, return the list of dicts we built, sorted by the timestamp with:

return [v for k, v in sorted(merged.items())]

Ain't Python fun? :-)

Exactly what I was looking for. Thank you for your edit explaining what you were doing with the setdefault method. Definitely a useful tool. Thank you for your time! Yes, it sure is fun!

Stack Exchange Network

Parsing a CSV file with Python

1 Answer 1

More Comprehensions:

Gathering like elements

So how does the setdefault work?

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parsing a CSV file with Python

1 Answer 1

More Comprehensions:

Gathering like elements

So how does the setdefault work?

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions