I am extracting data frome a given text file into a python dictionary for further processing
Input:
---------------------------- SECTION-A ----------------------------
Parameter1 : Text 1
Parameter2 : Text 2
Parameter3 : Text 3
Parameter4 : Text 4
Parameter5 : Text 5
Parameter6 : Text 6
Parameter7 : Text 7
https://www.google.com
---------------------------- SECTION-B ----------------------------
Parameter8 : Text 8
Parameter9 : Text 9
---------------------------- SECTION-C ----------------------------
Item C1
Item C2
Item C3
Item C4
---------------------------- SECTION-D ----------------------------
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Gravida cum sociis natoque penatibus et magnis dis parturient montes.
Metus dictum at tempor commodo ullamcorper a lacus vestibulum sed.
---------------------------- END ----------------------------
The number of elements in section C and D may vary. I process the data section by section so that i can easily add parameters if there is an extension to the layout.
Code:
import re
from pprint import pprint
def parser(lines):
SECTIONS = [
"SECTION-A",
"SECTION-B",
"SECTION-C",
"SECTION-D",
"END",
]
parameter = {
"Parameter1": "",
"Parameter2": "",
"Parameter3": "",
"Parameter4": "",
"Parameter5": "",
"Parameter6": "",
"Parameter7": "",
"Parameter8": "",
"Parameter9": "",
}
items_c = []
text_d = []
lines = [l.strip() for l in lines]
def find_section(line):
for s in SECTIONS:
if line.find(s) > 0:
return s
return None
def assign_parameter(line):
for p in parameter:
if line.startswith(p):
parameter[p] = line.split(":")[1]
act_section = ""
for line in lines:
line = re.sub(" +", " ", line) # Remove excess spaces
# Keep track of actual section
if (s := find_section(line)) is not None:
act_section = s
# Process lines according to section
if act_section == SECTIONS[0] or act_section == SECTIONS[1]:
assign_parameter(line)
if line.startswith("https://"):
url = line
elif act_section == SECTIONS[2]:
if len(line) > 1 and not line.startswith("--"):
items_c.append(line)
elif act_section == SECTIONS[3]:
if len(line) > 1 and not line.startswith("--"):
text_d.append(line)
# Append data in dictionary
parameter.update({"URL": url})
parameter.update({"Items C": items_c})
parameter.update({"Description D": " ".join(text_d)})
return parameter
if __name__ == "__main__":
with open("test.txt", "r", encoding="utf8") as f:
lines = f.readlines()
pprint(parser(lines))
Output:
{'Description D': 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, '
'sed do eiusmod tempor incididunt ut labore et dolore magna '
'aliqua. Gravida cum sociis natoque penatibus et magnis dis '
'parturient montes. Metus dictum at tempor commodo '
'ullamcorper a lacus vestibulum sed.',
'Items C': ['Item C1', 'Item C2', 'Item C3', 'Item C4'],
'Parameter1': ' Text 1',
'Parameter2': ' Text 2',
'Parameter3': ' Text 3',
'Parameter4': ' Text 4',
'Parameter5': ' Text 5',
'Parameter6': ' Text 6',
'Parameter7': ' Text 7',
'Parameter8': ' Text 8',
'Parameter9': ' Text 9',
'URL': 'https://www.google.com'}
My code is working but what could be improved or optimized?
-
\$\begingroup\$ Where does the text file actually come from? \$\endgroup\$Reinderien– Reinderien2022年08月24日 12:51:15 +00:00Commented Aug 24, 2022 at 12:51
-
\$\begingroup\$ It is an automated text mail with measurment values and a google maps URL. \$\endgroup\$HaWe– HaWe2022年08月24日 13:43:58 +00:00Commented Aug 24, 2022 at 13:43
-
1\$\begingroup\$ Are the sections expected to always occur in the same order? What should happen if sections are missing, or new sections appear? \$\endgroup\$AJNeufeld– AJNeufeld2022年08月24日 15:30:26 +00:00Commented Aug 24, 2022 at 15:30
-
\$\begingroup\$ Why is the sample text all indented by one space? I'm going to assume that that isn't intentional. \$\endgroup\$Reinderien– Reinderien2022年08月25日 02:04:20 +00:00Commented Aug 25, 2022 at 2:04
-
\$\begingroup\$ @AJNeufeld The sections are always in the same order and there should be no missing or new sections, but i think i will raise an exception when this happens. Good Point! \$\endgroup\$HaWe– HaWe2022年08月25日 05:41:48 +00:00Commented Aug 25, 2022 at 5:41
2 Answers 2
Dictionary updates
parameter.update({"URL": url})
is a very verbose way of writing parameter["URL"] = url
. Ditto for "Items C"
and "Description D"
fields.
Unbounded split
parameter[p] = line.split(":")[1]
ignores the possibility of the multiple colons in the key-value pair. If you ever have a parameter line which reads:
url : http://google.com/
Then you'll end up assigning http
to the url
key, not the remainder of the line. Limit the expected number of splits. parameter[p] = line.split(":", 1)[1]
Accidental key-value updates
for p in parameter:
if line.startswith(p):
parameter[p] = line.split(":")[1]
If you have "parameter1"
and "parameter10"
, both values will be updated when a parameter line with the longer key is encountered, like parameter10 : value
.
Better would be:
def assign_parameters(line):
if ':' in line:
name, value = line.split(":", 1) # Maximum 1 split
name = name.rstrip() # Remove trailing whitespace
if name in parameters:
parameter[name] = value
#else:
# LOG.warning("Unknown parameter %r", name)
This uses the entire name when checking if the parameter is a known. It also eliminates the time wasted looping over every parameter name.
Since you're already including the re
module, why not ...
def assign_parameter(line):
m = re.match(r"(?P<name>[^ :]+)\s*:(?P<value>.*)", line)
if m:
name = m['name']
if name in parameters:
parameters[name] = m['value']
#else:
# LOG.warning("Unknown parameter %r", name)
Section collisions
If a new section "SECTION-A1" appears in your input file, find_section(line)
may unexpectedly return "SECTION-A"
!
Unknown sections
If the input format changed, adding a new section after SECTION-C, ...
...
---------------------------- SECTION-C ----------------------------
Item C1
Item C2
Item C3
Item C4
---------------------------- SECTION-NEW ------------------------------
Not a C item
Another not a C item
...
... the additional lines will continue to be added to items_c
!
Based on if len(line) > 1 and not line.startswith("--"):
, it seems clear that sections are expected to start with two or more hyphens. You should explicitly detect for section breaks, not just the start of any known section.
if line.startswith("--") and line.endswith("--"):
act_section = line.strip("- ")
#if act_section not in SECTIONS:
# LOG.warning("Unknown section %r", act_section)
elif act_section == SECTION[0] or or act_section == SECTIONS[1]:
...
elif act_section == SECTIONS[2]:
if len(line) > 1:
items_c.append(line)
elif act_section == SECTIONS[3]:
if len(line) > 1:
text_d.append(line)
Notice that since --... SECTION-NAME ...--
lines are parsed separately, you no longer need to guard against lines starting with --
inside SECTIONS[2]
or SECTIONS[3]
.
Named Sections
What the heck does elif act_section == SECTIONS[2]:
mean, anyway? If a new section gets added before "SECTION-C", is it going to be added to the SECTIONS
list before that item? If so, you'll have to find everywhere there is SECTIONS[2]
and change that to SECTIONS[3]
, but first you'll have to change all SECTIONS[3]
to SECTIONS[4]
!
You're using SECTIONS[#]
to avoid typing out the whole string multiple times to reduce the chance of typos and ensure it only needs to be changed in one spot if it does require changing, but it is obfuscating the code such that the reader has to refer elsewhere and count to determine what section SECTIONS[2]
actually means! It needs to be commented better, but comments aren't actually tested so there is no guarantee the comments are actually correct.
You need:
def parser(lines):
SECTION_A = "SECTION-A"
SECTION_B = "SECTION-B"
SECTION_C = "SECTION-C"
SECTION_D = "SECTION-D"
SECTION_END = "END"
SECTIONS = {
SECTION_A,
SECTION_B,
SECTION_C,
SECTION_D,
SECTION_END,
}
...
elif act_section == SECTION_A or act_section == SECTION_B:
...
elif act_section == SECTION_C:
...
elif act_section == SECTION_D:
...
No more counting is necessary (or even possible because SECTIONS
is now a set).
At this point, it may be worth considering an Enum.
from enum import Enum
class Section(Enum):
A = "SECTION-A"
B = "SECTION-B"
C = "SECTION-C"
D = "SECTION-D"
END = "END"
Efficiency
There is no need to read and store the entire file into memory. You can easily process the file line by line.
In one sense, your code is "too special". You hard-code parameter names in your parameter
dict when you should dynamically form that dict per section from the input.
As AJNeufeld says, it's wholly possible to parse this in a streamed manner instead of putting all lines
in memory.
Each of your example sections has a different format. You might as well write a parser for each.
A more generic approach
- defines a sub-parser for every section type possible;
- maps those parsers to section titles;
- passes the map to a top-level parser which iterates through your sections; and
- produces an output map of section title to parse results.
Suggested
import re
from itertools import chain
from pprint import pprint
from typing import Protocol, Iterable, Iterator, Any
PARAM_PAT = re.compile(r'^\s*(\S+)\s*:\s*(.+?)\s*$')
class SectionParser(Protocol):
def __call__(self, lines: Iterable[str]) -> Any:
...
def parse_dict_with_url(lines: Iterable[str]) -> tuple[
dict[str, str],
str,
]:
params = {}
for line in lines:
match = PARAM_PAT.match(line)
if match:
params[match.group(1)] = match.group(2)
else:
url = next(lines).strip()
return params, url
def parse_dict(lines: Iterable[str]) -> dict[str, str]:
params = {}
for line in lines:
match = PARAM_PAT.match(line)
if match:
params[match.group(1)] = match.group(2)
else:
return params
def parse_list(lines: Iterable[str]) -> list[str]:
results = []
for line in lines:
line = line.strip()
if line == '':
return results
results.append(line)
def parse_text(lines: Iterable[str]) -> str:
def non_blank():
for line in lines:
line = line.strip()
if line == '':
return
yield line
return '\n'.join(non_blank())
def parse(sections: dict[str, SectionParser], lines: Iterable[str]) -> Iterator[tuple[str, Any]]:
header_pat = re.compile(r'^\s*-{20,} +(\S+) +-{20,}$')
for line in lines:
match = header_pat.match(line)
if match:
break
else:
return
header_name = match.group(1)
while header_name != 'END':
def lines_for_section():
for line in lines:
if line.strip() != '':
break
else:
raise ValueError('Short section')
nonlocal match
for line in chain((line,), lines):
match = header_pat.match(line)
if match:
break
else:
yield line
else:
raise ValueError('No next section header')
parser = sections[header_name]
section_lines = lines_for_section()
result = parser(section_lines)
yield header_name, result
for _ in section_lines: pass # consume remaining
header_name = match.group(1)
def main() -> None:
sections = {
'SECTION-A': parse_dict_with_url,
'SECTION-B': parse_dict,
'SECTION-C': parse_list,
'SECTION-D': parse_text,
}
with open('279116.txt') as f:
section_results = dict(parse(sections, f))
pprint(section_results)
if __name__ == '__main__':
main()
Output
{'SECTION-A': ({'Parameter1': 'Text 1',
'Parameter2': 'Text 2',
'Parameter3': 'Text 3',
'Parameter4': 'Text 4',
'Parameter5': 'Text 5',
'Parameter6': 'Text 6',
'Parameter7': 'Text 7'},
'https://www.google.com'),
'SECTION-B': {'Parameter8': 'Text 8', 'Parameter9': 'Text 9'},
'SECTION-C': ['Item C1', 'Item C2', 'Item C3', 'Item C4'],
'SECTION-D': 'Lorem ipsum dolor sit amet, consectetur adipiscing elit,\n'
'sed do eiusmod tempor incididunt ut labore et dolore magna '
'aliqua.\n'
'Gravida cum sociis natoque penatibus et magnis dis parturient '
'montes.\n'
'Metus dictum at tempor commodo ullamcorper a lacus vestibulum '
'sed.'}