Structuring .csv in Python

Question 1

I'm wondering how I could build a .csv file with a proper structure. As an example, my data has the form:

(indice, latitude, longitude, value)

- 0 - lat=-51.490000 lon=264.313000 value=7.270077
- 1 - lat=-51.490000 lon=264.504000 value=7.231014
- 2 - lat=-51.490000 lon=264.695000 value=21.199764
- 3 - lat=-51.490000 lon=264.886000 value=49.176327
- 4 - lat=-51.490000 lon=265.077000 value=91.160702
- 5 - lat=-51.490000 lon=265.268000 value=147.152889
- 6 - lat=-51.490000 lon=265.459000 value=217.152889
- 7 - lat=-51.490000 lon=265.650000 value=301.160702
- 8 - lat=-51.490000 lon=265.841000 value=399.176327
- 9 - lat=-51.490000 lon=266.032000 value=511.199764
- 10 - lat=-51.490000 lon=266.223000 value=637.231014
- 11 - lat=-51.490000 lon=266.414000 value=777.270077
- 12 - lat=-51.490000 lon=266.605000 value=931.316952
- 13 - lat=-51.490000 lon=266.796000 value=1099.371639
- 14 - lat=-51.490000 lon=266.987000 value=1281.434139
- 15 - lat=-51.490000 lon=267.178000 value=1477.504452
- 16 - lat=-51.490000 lon=267.369000 value=1687.582577
- 17 - lat=-51.490000 lon=267.560000 value=1911.668514
- 18 - lat=-51.490000 lon=267.751000 value=2149.762264
- 19 - lat=-51.490000 lon=267.942000 value=2401.863827
- 20 - lat=-51.490000 lon=268.133000 value=2667.973202
- 21 - lat=-51.490000 lon=268.324000 value=2948.090389

I would like to be able to save this data in .csv file with the format:

 | longitude | 
latitude | value |

That is, all the values with the same latitude would be in the same line and all the values with the same longitude would be in the same column. I know how to write a .csv file in Python, I'm wondering how could I perform this transformation properly.

Thank you in advance.

Thank you.

Question 2

You will first have to loop over the data to collect all longitudes. Those will be your columns. Then I would probably create a dictionary for each latitude which contains longitude/value pairs. Then you can write a line for each latitude.. you should take a look at the csv.DictWriter class.

Question 3

I'd break up the lines with a regex and then use nested dicts to record the values mydict[latitude][longitude] = value. I'd also make a set of longitudes. The size of this set is the number of columns, make it a list and sort it to get an indexer into the nested list. Sort the latitude keys and off you go.

Question 4

What happens if there are more values pre lat/lon pair? What if there are two latitudes or longitudes which are almost the same but not exactly?

Question 5

I wrote a little program for you :) see below.

I'm assuming for now that your data is stored as a list of dicts, but if it is a list of lists the code shouldn't be too hard to fix.

#!/usr/bin/env python
import csv
data = [
 dict(lat=1, lon=1, val=10),
 dict(lat=1, lon=2, val=20),
 dict(lat=2, lon=1, val=30),
 dict(lat=2, lon=2, val=40),
 dict(lat=3, lon=1, val=50),
 dict(lat=3, lon=2, val=60),
]
# get a unique list of all longitudes
headers = list({d['lon'] for d in data})
headers.sort()
# make a dict of latitudes
data_as_dict = {}
for item in data:
 # default value: a list of empty strings
 lst = data_as_dict.setdefault(item['lat'], ['']*len(headers))
 # get the longitute for this item
 lon = item['lon']
 # where in the line should it be?
 idx = headers.index(lon)
 # save value in the list
 lst[idx]=item['val']
# in the actual file, we start with an extra header for the latitude
headers.insert(0,'latitude')
with open('latitude.csv', 'w') as csvfile:
 writer = csv.writer(csvfile, delimiter=' ',
 quotechar='|', quoting=csv.QUOTE_MINIMAL)
 writer.writerow(headers)
 lats = data_as_dict.keys()
 lats.sort()
 for latitude in lats:
 # a line starts with the latitude, followed by list of values
 l = data_as_dict[latitude]
 l.insert(0, latitude)
 writer.writerow(l)

output:

latitude 1 2
1 10 20
2 30 40
3 50 60

Granted, it's not the prettiest code, but I hope you get the idea

Question 6

Hi @rje. Thank you for your answer. A little thing that I forget to ask... It is possible to order by lat and long? My data is ordered but with this code the result isn't. Thank you.

Question 7

Nope, ordering is not necessary, it'll work with unordered data too!

Question 8

Yes, but I guess I wasn't clear. It worker, however, my data isn't ordered in the output file as it was in the input. Trying to manage this here.

Question 9

Ah, I see. Changed the code a bit to sort the headers and keys :)

Question 10

I'm assuming you have this data in a text file. Let's use regular expressions to parse the data (though string splitting looks like it could work if your format stays the same).

import re
data = list()
with open('path/to/data/file','r') as infile:
 for line in infile:
 matches = re.match(r".*(?<=lat=)(?P<lat>(?:\+|-)?[\d.]+).*(?<=value=)(?P<longvalue>(?:\+|-)?[\d.]+)", line)
 data.append((matches.group('lat'), matches.group('longvalue'))

To unroll that nasty regex:

pat = re.compile(r"""
 .* Match anything any number of times
 (?<=lat=) assert that the last 4 characters are "lat="
 (?P<lat> begin named capturing group "lat"
 (?:\+|-)? allow one or none of either + or -
 [\d.]+ and one or more digits or decimal points
 ) end named capturing group "lat"
 .* Another wildcard
 (?<=value=) assert that the last 6 characters are "value="
 (?P<longvalue> begin named capturing group "longvalue"
 (?:\+|-)? allow one or none of either + or -
 [\d.]+ and one or more digits or decimal points
 ) end named capturing group "longvalue"
""", re.X)
# and a terser way of writing the code, since we've compiled the pattern above:
with open('path/to/data/file', 'r') as infile:
 data = [(matches.group('lat'), matches.group('longvalue')) for line in infile for
 matches in (re.match(pat, line),)]

Question 11

Given your input data, I came up with the following:

from __future__ import print_function
def decode(line):
 line = line.replace('- ', ' ')
 fields = line.split()
 index = fields[0]
 data = dict([_.split('=') for _ in fields[1:]])
 return index, data
def transform(filename):
 transformed = {}
 columns = set()
 for line in open(filename):
 index, data = decode(line.strip())
 element = transformed.setdefault(data['lat'], {})
 element[data['lon']] = data['value']
 columns.add(data['lon'])
 return columns, transformed
def main(filename):
 columns, transformed = transform(filename)
 columns = sorted(columns)
 print(',', ','.join(columns))
 for lat, data in transformed.items():
 print(lat, ',', ', '.join([data.get(_, 'NULL') for _ in columns]))
if __name__ == '__main__':
 main('so.txt')

Just in case, where the data contains more than only one latitude, I had added one additional line to the example, so my input data (so.txt) contained this:

- 0 - lat=-51.490000 lon=264.313000 value=7.270077
- 1 - lat=-51.490000 lon=264.504000 value=7.231014
- 2 - lat=-51.490000 lon=264.695000 value=21.199764
- 3 - lat=-51.490000 lon=264.886000 value=49.176327
- 4 - lat=-51.490000 lon=265.077000 value=91.160702
- 5 - lat=-51.490000 lon=265.268000 value=147.152889
- 6 - lat=-51.490000 lon=265.459000 value=217.152889
- 7 - lat=-51.490000 lon=265.650000 value=301.160702
- 8 - lat=-51.490000 lon=265.841000 value=399.176327
- 9 - lat=-51.490000 lon=266.032000 value=511.199764
- 10 - lat=-51.490000 lon=266.223000 value=637.231014
- 11 - lat=-51.490000 lon=266.414000 value=777.270077
- 12 - lat=-51.490000 lon=266.605000 value=931.316952
- 13 - lat=-51.490000 lon=266.796000 value=1099.371639
- 14 - lat=-51.490000 lon=266.987000 value=1281.434139
- 15 - lat=-51.490000 lon=267.178000 value=1477.504452
- 16 - lat=-51.490000 lon=267.369000 value=1687.582577
- 17 - lat=-51.490000 lon=267.560000 value=1911.668514
- 18 - lat=-51.490000 lon=267.751000 value=2149.762264
- 19 - lat=-51.490000 lon=267.942000 value=2401.863827
- 20 - lat=-51.490000 lon=268.133000 value=2667.973202
- 21 - lat=-51.490000 lon=268.324000 value=2948.090389
- 22 - lat=-52.490000 lon=268.324000 value=2948.090389

(note the last line)

With that input file, the above program creates the following output:

, 264.313000,264.504000,264.695000,264.886000,265.077000,265.268000,265.459000,265.650000,265.841000,266.032000,266.223000,266.414000,266.605000,266.796000,266.987000,267.178000,267.369000,267.560000,267.751000,267.942000,268.133000,268.324000
-51.490000 , 7.270077, 7.231014, 21.199764, 49.176327, 91.160702, 147.152889, 217.152889, 301.160702, 399.176327, 511.199764, 637.231014, 777.270077, 931.316952, 1099.371639, 1281.434139, 1477.504452, 1687.582577, 1911.668514, 2149.762264, 2401.863827, 2667.973202, 2948.090389
-52.490000 , NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, 2948.090389

Question 12

YOu can pull lat/lon/value from each line using a regex. You'll want to lookup lat and lon later, so use a nested dict of the form d[lat][lon]=value to track it all. Add a set to keep track of the unique longitudes you see, and its pretty straight forward to generate the csv.

I sorted it in the example, but you may not care about that.

import re
import collections
data = """- 0 - lat=-51.490000 lon=264.313000 value=7.270077
- 1 - lat=-51.490000 lon=264.504000 value=7.231014
- 2 - lat=-51.490000 lon=264.695000 value=21.199764
- 3 - lat=-51.490000 lon=264.886000 value=49.176327
- 4 - lat=-51.490000 lon=265.077000 value=91.160702"""
regex = re.compile(r'- \d+ - lat=([\+\-]?[\d\.]+) lon=([\+\-]?[\d\.]+) value=([\+\-]?[\d\.]+)')
# lat/lon index will hold lats[latitude][longitude] = value
lats = collections.defaultdict(dict)
# longitude columns
lonset = set()
for line in data.split('\n'):
 match = regex.match(line)
 if match:
 lat, lon, val = match.groups()
 lats[lat][lon] = val
 lonset.add(lon)
latkeys = sorted(lats.keys())
lonkeys = sorted(list(lonset))
header = ['latitude'] + lonkeys
print header
for lat in latkeys:
 lons = lats[lat]
 row = [lat] + [lons.get(lon, '') for lon in lonkeys]
 print row

rje 6,5281 gold badge23 silver badges42 bronze badges · Accepted Answer · 2014-09-16 15:54:49Z

I wrote a little program for you :) see below.

I'm assuming for now that your data is stored as a list of dicts, but if it is a list of lists the code shouldn't be too hard to fix.

#!/usr/bin/env python
import csv
data = [
 dict(lat=1, lon=1, val=10),
 dict(lat=1, lon=2, val=20),
 dict(lat=2, lon=1, val=30),
 dict(lat=2, lon=2, val=40),
 dict(lat=3, lon=1, val=50),
 dict(lat=3, lon=2, val=60),
]
# get a unique list of all longitudes
headers = list({d['lon'] for d in data})
headers.sort()
# make a dict of latitudes
data_as_dict = {}
for item in data:
 # default value: a list of empty strings
 lst = data_as_dict.setdefault(item['lat'], ['']*len(headers))
 # get the longitute for this item
 lon = item['lon']
 # where in the line should it be?
 idx = headers.index(lon)
 # save value in the list
 lst[idx]=item['val']
# in the actual file, we start with an extra header for the latitude
headers.insert(0,'latitude')
with open('latitude.csv', 'w') as csvfile:
 writer = csv.writer(csvfile, delimiter=' ',
 quotechar='|', quoting=csv.QUOTE_MINIMAL)
 writer.writerow(headers)
 lats = data_as_dict.keys()
 lats.sort()
 for latitude in lats:
 # a line starts with the latitude, followed by list of values
 l = data_as_dict[latitude]
 l.insert(0, latitude)
 writer.writerow(l)

output:

latitude 1 2
1 10 20
2 30 40
3 50 60

Granted, it's not the prettiest code, but I hope you get the idea

Hi @rje. Thank you for your answer. A little thing that I forget to ask... It is possible to order by lat and long? My data is ordered but with this code the result isn't. Thank you.
Nope, ordering is not necessary, it'll work with unordered data too!
Yes, but I guess I wasn't clear. It worker, however, my data isn't ordered in the output file as it was in the input. Trying to manage this here.
Ah, I see. Changed the code a bit to sort the headers and keys :)

CollectivesTM on Stack Overflow

Structuring .csv in Python

4 Answers 4

4 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

4 Answers 4

4 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related