CSV reader ignore brackets

MRAB python at mrabarnett.plus.com
Tue Sep 24 19:50:40 EDT 2019


On 2019年09月25日 00:09, Cameron Simpson wrote:
> On 24Sep2019 15:55, Mihir Kothari <mihir.kothari at gmail.com> wrote:
>>I am using python 3.4. I have a CSV file as below:
>>>>ABC,PQR,(TEST1,TEST2)
>>FQW,RTE,MDE
>> Really? No quotes around the (TEST1,TEST2) column value? I would have
> said this is invalid data, but that does not help you.
>>>Basically comma-separated rows, where some rows have a data in column which
>>is array like i.e. in brackets.
>>So I need to read the file and treat such columns as one i.e. do not
>>separate based on comma if it is inside the bracket.
>>>>In short I need to read a CSV file where separator inside the brackets
>>needs to be ignored.
>>>>Output:
>>Column: 1 2 3
>>Row1: ABC PQR (TEST1,TEST2)
>>Row2: FQW RTE MDE
>>>>Can you please help with the snippet?
>> I would be reaching for a regular expression. If you partition your
> values into 2 types: those starting and ending in a bracket, and those
> not, you could write a regular expression for the former:
>> \([^)]*\)
>> which matches a string like (.....) (with, importantly, no embedded
> brackets, only those at the beginning and end.
>> And you can write a regular expression like:
>> [^,]*
>> for a value containing no commas i.e. all the other values.
>> Test the bracketed one first, because the second one always matches
> something.
>> Then you would not use the CSV module (which expects better formed data
> than you have) and instead write a simple parser for a line of text
> which tries to match one of these two expressions repeatedly to consume
> the line. Something like this (UNTESTED):
>> bracketed_re = re.compile(r'\([^)]*\)')
> no_commas_re = re.compile(r'[^,]*')
>> def split_line(line):
> line = line.rstrip() # drop trailing whitespace/newline
> fields = []
> offset = 0
> while offset < len(line):
> m = bracketed_re.match(line, offset)
> if m:
> field = m.group()
> else:
> m = no_commas_re.match(line, offset) # this always matches
> field = m.group()
> fields.append(field)
> offset += len(field)
> if line.startswith(',', offset):
> # another column
> offset += 1
> elif offset < len(line):
> raise ValueError(
> "incomplete parse at offset %d, line=%r" % (offset, line))
> return fields
>> Then read the lines of the file and split them into fields:
>> row = []
> with open(datafilename) as f:
> for line in f:
> fields = split_line(line)
> rows.append(fields)
>> So basicly you're writing a little parser. If you have nested brackets
> things get harder.
>You can simplify that somewhat to this:
import re
rows = []
with open(datafilename) as f:
 for line in f:
 rows.append(re.findall(r'(\([^)]*\)|(?=.)[^,\n]*),?', line))


More information about the Python-list mailing list

AltStyle によって変換されたページ (->オリジナル) /