I am given a CSV file of stations, with data like this:
station_id,date,temperature_c
68,2000.375,10.500
68,2000.542,5.400
68,2000.958,23.000
68,2001.125,20.400
68,2001.292,13.300
68,2001.375,10.400
68,2001.958,21.800
68,2002.208,15.500
and so on for many different station_id
s.
Then I want to create a Python program that (1) gives the minimum reading (the third column) and (2) the station with the maximum "travel distance" with its readings. Thus if a station has 3 readings of -5,0,8 then that would mean a travel distance of 13. This can take an optional date range. Here is what I did.
#!/usr/bin/python
from collections import defaultdict
import csv
import random
import sys
# In order to track each station's statistics, we'll create a Station class
# to hold the data on a per-station basis.
class Station:
def __init__(self):
self.readings = []
self.minimum = 99999999.0
self.travel = 0
# travel holds the change in temperature reading-by-reading
def get_travel(self):
return self.travel
def set_travel(self, n):
self.travel += abs(n)
# getter & setter for station minimums
def get_minimum(self):
return self.minimum
def set_minimum(self, n):
self.minimum = n
# infrastructure for future code expansion
def get_readings(self):
return self.readings
def set_readings(self, date, temp):
self.readings.append({ "date" : date, "temp" : temp})
"""
Reporter class handles a list of Stations
"""
class Reporter:
def __init__(self):
# stations dict with entries for holding the specified stats.
self.stations = defaultdict(Station)
self.global_minimum = { "station" : "default", "date" : 1, "temp" : 9999999 }
self.longest_travel = { "station" : "default", "range" : 0 }
"""
Determines which station recorded the coldest temperature
args: CSV file
returns: dict with data
"""
def minimum_temperature(self, filename):
with open(filename, 'r') as datafile:
try:
csv_reader = csv.reader(datafile)
next(csv_reader)
# reading line-by-line since CSV could be a big file
for row in csv_reader:
station, date, temp = row
# save the station's readings
self.stations[station].set_readings(date, temp)
temp = float(temp)
if (temp < self.stations[station].get_minimum()):
self.stations[station].set_minimum(temp)
if(temp < self.global_minimum["temp"]):
self.global_minimum = { "station" : station, "temp" : temp, "date" : date }
# The specs state that in the event that a tie occurs simply return
# one pair at random.
if (temp == self.global_minimum["temp"]):
if (random.randint(1,100) % 2 == 0):
self.global_minimum = { "station" : station, "date" : date, "temp" : temp }
except csv.Error as e:
sys.exit('file {}, line {}: {}'.format(filename, reader.line_num, e))
return self.global_minimum
"""
Determines which station "traveled" the most
args: CSV file, begin date (optional), end date (optional)
returns: dict with data
"""
def max_travel(self,filename,begin=1.0,end=9999.9):
with open(filename, 'r') as datafile:
try:
csv_reader = csv.reader(datafile)
next(csv_reader)
# reading line-by-line since CSV could be a big file
for row in csv_reader:
station, date, temp = row
# save for future expansion
self.stations[station].set_readings(date, temp)
date = float(date)
if date > begin and date < end:
temp = float(temp)
self.stations[station].set_travel(temp)
travel = self.stations[station].get_travel()
if ( travel > self.longest_travel["range"]):
self.longest_travel = { "station" : station, "range" : travel }
except csv.Error as e:
sys.exit('file {}, line {}: {}'.format(filename, reader.line_num, e))
return self.longest_travel
if __name__ == "__main__":
csv_file = sys.argv[1]
# fetch lowest temperature
reporter = Reporter()
global_minimum = reporter.minimum_temperature(csv_file)
print("station {} had the global minimum on {}".format(global_minimum["station"], global_minimum["date"]))
# fetch maximum travel overall
longest_travel = reporter.max_travel(csv_file)
print("station {} had the greatest travel at {}".format(longest_travel["station"], longest_travel["range"]))
# now try a date range
reporter2 = Reporter()
begin = 2001.0
end = 2006.0
longest_travel = reporter2.max_travel(csv_file,begin,end)
print("for {} to {}, station {} had the greatest travel at {}".format(begin, end, longest_travel["station"], longest_travel["range"]))
I'm particularly interested in speeding it up and memory usage but also how to Pythonically deal with with getters/setters.
2 Answers 2
Getters and Setters
If you are to use this pattern, the @property
decorator will accomplish this in a more pythonic way:
class Station:
def __init__(self):
self.readings = []
self._minimum = 99999999.0
self._travel = 0
@property
def minimum(self, value):
return self._minimum
@minimum.setter
def minimum(self, value):
self._minimum = value
@property
def travel(self, value):
return self._travel
@travel.setter
def travel(self, value):
self._travel = abs(value)
def add_reading(self, date, temp):
self.readings.append({ "date" : date, "temp" : temp})
# access the elements this way
station = Station()
print(station.travel)
0
station.travel += 1
print(station.travel)
1
However, the setter pattern doesn't really make sense for your readings
, because you aren't really setting, you're appending. So keep a function that appends a reading to the readings list, it's more explicit this way. I wouldn't expect:
station.readings = ('date', 'reading')
to do anything other than set the attribute, but it actually has a surprising side-effect.
Spacing
Your indentation should be four spaces, not two.
- Probably a typo in your
except
clauses; I suspect it's supposed to saycsv_reader.line_num
. - What is going on with the "travel"? What you've actually written is "sum of absolute values", but I assume what you actually want is either "max minus min" or "sum of day-to-day deltas".
- My opinion is that getters/setters/properties are inherently un-pythonic. More importantly, they're opaque. They look like attributes but they're functions that can have side-effects. Don't hide the possibility of side-effects from the audience; do just use an actual attribute wwh
- I like the
DictReader
provided by thecsv
module. It's not everything we might ask for, but it'll save us accidentally handling malformed data. - You're doing a lot of work to handle the file in a stream, but you're also reading the file multiple times and you're keeping all the data in memory once it's read.
- This could get buggy. For example, in your
main
, you re-usereporter
for the second pass, which means it will contain every datapoint in duplicate. - You should pick one: Either get everything you want in a single pass, or read all the data into an appropriate structure and run queries on that. Since you specify speed and memory usage as concerns, I'll assume you want the first choice (but I suspect that's the wrong decision).
- This could get buggy. For example, in your
- You probably don't need a
Reporter
class at all. It's not helping you manage any state (that you want to keep) across functions, so just declare your functions in the top namespace.