save tuple of simple data types to disk (low memory foot print)

Sat Oct 29 13:47:42 EDT 2011

On 10/29/11 11:44, Gelonida N wrote:
> I would like to save many dicts with a fixed (and known) amount of keys
> in a memory efficient manner (no random, but only sequential access is
> required) to a file (which can later be sent over a slow expensive
> network to other machines)
>> Example:
> Every dict will have the keys 'timestamp', 'floatvalue', 'intvalue',
> 'message1', 'message2'
> 'timestamp' is an integer
> 'floatvalue' is a float
> 'intvalue' an int
> 'message1' is a string with a length of max 2000 characters, but can
> often be very short
> 'message2' the same as message1
>> so a typical dict will look like
> { 'timetamp' : 12, 'floatvalue': 3.14159, 'intvalue': 42,
> 'message1' : '', 'message2' : '=' * 1999 }
>>>>>> What do you call "many"? Fifty? A thousand? A thousand million? How many
>> items in each dict? Ten? A million?
>> File size can be between 100kb and over 100Mb per file. Files will be
> accumulated over months.

If Steven's pickle-protocol2 solution doesn't quite do what you 
need, you can do something like the code below. Gzip is pretty 
good at addressing...
>> Or have you considered simply compressing the files?
> Compression makes sense but the inital file format should be
> already rather 'compact'

...by compressing out a lot of the duplicate aspects. Which also 
mitigates some of the verbosity of CSV.
It serializes the data to a gzipped CSV file then unserializes 
it. Just point it at the appropriate data-source, adjust the 
column-names and data-types
-tkc
from gzip import GzipFile
from csv import writer, reader
data = [ # use your real data here
 {
 'timestamp': 12,
 'floatvalue': 3.14159,
 'intvalue': 42,
 'message1': 'hello world',
 'message2': '=' * 1999,
 },
 ] * 10000
f = GzipFile('data.gz', 'wb')
try:
 w = writer(f)
 for row in data:
 w.writerow([
 row[name] for name in (
 # use your real col-names here
 'timestamp',
 'floatvalue',
 'intvalue',
 'message1',
 'message2',
 )])
finally:
 f.close()
output = []
for row in reader(GzipFile('data.gz')):
 d = dict((
 (name, f(row[i]))
 for i, (f,name) in enumerate((
 # adjust for your column-names/data-types
 (int, 'timestamp'),
 (float, 'floatvalue'),
 (int, 'intvalue'),
 (str, 'message1'),
 (str, 'message2'),
 ))))
 output.append(d)
# or
output = [
 dict((
 (name, f(row[i]))
 for i, (f,name) in enumerate((
 # adjust for your column-names/data-types
 (int, 'timestamp'),
 (float, 'floatvalue'),
 (int, 'intvalue'),
 (str, 'message1'),
 (str, 'message2'),
 ))))
 for row in reader(GzipFile('data.gz'))
 ]