Extract data from formatted text file

Question 1

I am extracting data frome a given text file into a python dictionary for further processing

Input:

 ---------------------------- SECTION-A ----------------------------
 Parameter1 : Text 1
 Parameter2 : Text 2
 Parameter3 : Text 3
 Parameter4 : Text 4
 Parameter5 : Text 5
 Parameter6 : Text 6
 Parameter7 : Text 7
 https://www.google.com
 ---------------------------- SECTION-B ----------------------------
 Parameter8 : Text 8
 Parameter9 : Text 9
 ---------------------------- SECTION-C ----------------------------
 Item C1
 Item C2
 Item C3
 Item C4
 ---------------------------- SECTION-D ----------------------------
 Lorem ipsum dolor sit amet, consectetur adipiscing elit,
 sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
 Gravida cum sociis natoque penatibus et magnis dis parturient montes.
 Metus dictum at tempor commodo ullamcorper a lacus vestibulum sed.
 ---------------------------- END ----------------------------

The number of elements in section C and D may vary. I process the data section by section so that i can easily add parameters if there is an extension to the layout.

Code:

import re
from pprint import pprint
def parser(lines):
 SECTIONS = [
 "SECTION-A",
 "SECTION-B",
 "SECTION-C",
 "SECTION-D",
 "END",
 ]
 parameter = {
 "Parameter1": "",
 "Parameter2": "",
 "Parameter3": "",
 "Parameter4": "",
 "Parameter5": "",
 "Parameter6": "",
 "Parameter7": "",
 "Parameter8": "",
 "Parameter9": "",
 }
 items_c = []
 text_d = []
 lines = [l.strip() for l in lines]
 def find_section(line):
 for s in SECTIONS:
 if line.find(s) > 0:
 return s
 return None
 def assign_parameter(line):
 for p in parameter:
 if line.startswith(p):
 parameter[p] = line.split(":")[1]
 act_section = ""
 for line in lines:
 line = re.sub(" +", " ", line) # Remove excess spaces
 # Keep track of actual section
 if (s := find_section(line)) is not None:
 act_section = s
 # Process lines according to section
 if act_section == SECTIONS[0] or act_section == SECTIONS[1]:
 assign_parameter(line)
 if line.startswith("https://"):
 url = line
 elif act_section == SECTIONS[2]:
 if len(line) > 1 and not line.startswith("--"):
 items_c.append(line)
 elif act_section == SECTIONS[3]:
 if len(line) > 1 and not line.startswith("--"):
 text_d.append(line)
 # Append data in dictionary
 parameter.update({"URL": url})
 parameter.update({"Items C": items_c})
 parameter.update({"Description D": " ".join(text_d)})
 return parameter
if __name__ == "__main__":
 with open("test.txt", "r", encoding="utf8") as f:
 lines = f.readlines()
 pprint(parser(lines))

Output:

{'Description D': 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, '
 'sed do eiusmod tempor incididunt ut labore et dolore magna '
 'aliqua. Gravida cum sociis natoque penatibus et magnis dis '
 'parturient montes. Metus dictum at tempor commodo '
 'ullamcorper a lacus vestibulum sed.',
 'Items C': ['Item C1', 'Item C2', 'Item C3', 'Item C4'],
 'Parameter1': ' Text 1',
 'Parameter2': ' Text 2',
 'Parameter3': ' Text 3',
 'Parameter4': ' Text 4',
 'Parameter5': ' Text 5',
 'Parameter6': ' Text 6',
 'Parameter7': ' Text 7',
 'Parameter8': ' Text 8',
 'Parameter9': ' Text 9',
 'URL': 'https://www.google.com'}

My code is working but what could be improved or optimized?

Question 2

Where does the text file actually come from?

Question 3

It is an automated text mail with measurment values and a google maps URL.

Question 4

Are the sections expected to always occur in the same order? What should happen if sections are missing, or new sections appear?

Question 5

Why is the sample text all indented by one space? I'm going to assume that that isn't intentional.

Question 6

@AJNeufeld The sections are always in the same order and there should be no missing or new sections, but i think i will raise an exception when this happens. Good Point!

Question 7

Dictionary updates

parameter.update({"URL": url}) is a very verbose way of writing parameter["URL"] = url. Ditto for "Items C" and "Description D" fields.

Unbounded split

parameter[p] = line.split(":")[1] ignores the possibility of the multiple colons in the key-value pair. If you ever have a parameter line which reads:

url : http://google.com/

Then you'll end up assigning http to the url key, not the remainder of the line. Limit the expected number of splits. parameter[p] = line.split(":", 1)[1]

Accidental key-value updates

 for p in parameter:
 if line.startswith(p):
 parameter[p] = line.split(":")[1]

If you have "parameter1" and "parameter10", both values will be updated when a parameter line with the longer key is encountered, like parameter10 : value.

Better would be:

 def assign_parameters(line):
 if ':' in line:
 name, value = line.split(":", 1) # Maximum 1 split
 name = name.rstrip() # Remove trailing whitespace
 if name in parameters:
 parameter[name] = value
 #else:
 # LOG.warning("Unknown parameter %r", name)

This uses the entire name when checking if the parameter is a known. It also eliminates the time wasted looping over every parameter name.

Since you're already including the re module, why not ...

 def assign_parameter(line):
 m = re.match(r"(?P<name>[^ :]+)\s*:(?P<value>.*)", line)
 if m:
 name = m['name']
 if name in parameters:
 parameters[name] = m['value']
 #else:
 # LOG.warning("Unknown parameter %r", name)

Section collisions

If a new section "SECTION-A1" appears in your input file, find_section(line) may unexpectedly return "SECTION-A"!

Unknown sections

If the input format changed, adding a new section after SECTION-C, ...

...
 ---------------------------- SECTION-C ----------------------------
 Item C1
 Item C2
 Item C3
 Item C4
 ---------------------------- SECTION-NEW ------------------------------
 Not a C item
 Another not a C item
...

... the additional lines will continue to be added to items_c!

Based on if len(line) > 1 and not line.startswith("--"):, it seems clear that sections are expected to start with two or more hyphens. You should explicitly detect for section breaks, not just the start of any known section.

 if line.startswith("--") and line.endswith("--"):
 act_section = line.strip("- ")
 #if act_section not in SECTIONS:
 # LOG.warning("Unknown section %r", act_section)
 elif act_section == SECTION[0] or or act_section == SECTIONS[1]:
 ...
 elif act_section == SECTIONS[2]:
 if len(line) > 1:
 items_c.append(line)
 elif act_section == SECTIONS[3]:
 if len(line) > 1:
 text_d.append(line)

Notice that since --... SECTION-NAME ...-- lines are parsed separately, you no longer need to guard against lines starting with -- inside SECTIONS[2] or SECTIONS[3].

Named Sections

What the heck does elif act_section == SECTIONS[2]: mean, anyway? If a new section gets added before "SECTION-C", is it going to be added to the SECTIONS list before that item? If so, you'll have to find everywhere there is SECTIONS[2] and change that to SECTIONS[3], but first you'll have to change all SECTIONS[3] to SECTIONS[4]!

You're using SECTIONS[#] to avoid typing out the whole string multiple times to reduce the chance of typos and ensure it only needs to be changed in one spot if it does require changing, but it is obfuscating the code such that the reader has to refer elsewhere and count to determine what section SECTIONS[2] actually means! It needs to be commented better, but comments aren't actually tested so there is no guarantee the comments are actually correct.

You need:

def parser(lines):
 SECTION_A = "SECTION-A"
 SECTION_B = "SECTION-B"
 SECTION_C = "SECTION-C"
 SECTION_D = "SECTION-D"
 SECTION_END = "END"
 
 SECTIONS = {
 SECTION_A,
 SECTION_B,
 SECTION_C,
 SECTION_D,
 SECTION_END,
 }
 ...
 elif act_section == SECTION_A or act_section == SECTION_B:
 ...
 elif act_section == SECTION_C:
 ...
 elif act_section == SECTION_D:
 ...

No more counting is necessary (or even possible because SECTIONS is now a set).

At this point, it may be worth considering an Enum.

from enum import Enum
class Section(Enum):
 A = "SECTION-A"
 B = "SECTION-B"
 C = "SECTION-C"
 D = "SECTION-D"
 END = "END"

Efficiency

There is no need to read and store the entire file into memory. You can easily process the file line by line.

Question 8

In one sense, your code is "too special". You hard-code parameter names in your parameter dict when you should dynamically form that dict per section from the input.

As AJNeufeld says, it's wholly possible to parse this in a streamed manner instead of putting all lines in memory.

Each of your example sections has a different format. You might as well write a parser for each.

A more generic approach

defines a sub-parser for every section type possible;
maps those parsers to section titles;
passes the map to a top-level parser which iterates through your sections; and
produces an output map of section title to parse results.

Suggested

import re
from itertools import chain
from pprint import pprint
from typing import Protocol, Iterable, Iterator, Any
PARAM_PAT = re.compile(r'^\s*(\S+)\s*:\s*(.+?)\s*$')
class SectionParser(Protocol):
 def __call__(self, lines: Iterable[str]) -> Any:
 ...
def parse_dict_with_url(lines: Iterable[str]) -> tuple[
 dict[str, str],
 str,
]:
 params = {}
 for line in lines:
 match = PARAM_PAT.match(line)
 if match:
 params[match.group(1)] = match.group(2)
 else:
 url = next(lines).strip()
 return params, url
def parse_dict(lines: Iterable[str]) -> dict[str, str]:
 params = {}
 for line in lines:
 match = PARAM_PAT.match(line)
 if match:
 params[match.group(1)] = match.group(2)
 else:
 return params
def parse_list(lines: Iterable[str]) -> list[str]:
 results = []
 for line in lines:
 line = line.strip()
 if line == '':
 return results
 results.append(line)
def parse_text(lines: Iterable[str]) -> str:
 def non_blank():
 for line in lines:
 line = line.strip()
 if line == '':
 return
 yield line
 return '\n'.join(non_blank())
def parse(sections: dict[str, SectionParser], lines: Iterable[str]) -> Iterator[tuple[str, Any]]:
 header_pat = re.compile(r'^\s*-{20,} +(\S+) +-{20,}$')
 for line in lines:
 match = header_pat.match(line)
 if match:
 break
 else:
 return
 header_name = match.group(1)
 while header_name != 'END':
 def lines_for_section():
 for line in lines:
 if line.strip() != '':
 break
 else:
 raise ValueError('Short section')
 nonlocal match
 for line in chain((line,), lines):
 match = header_pat.match(line)
 if match:
 break
 else:
 yield line
 else:
 raise ValueError('No next section header')
 parser = sections[header_name]
 section_lines = lines_for_section()
 result = parser(section_lines)
 yield header_name, result
 for _ in section_lines: pass # consume remaining
 header_name = match.group(1)
def main() -> None:
 sections = {
 'SECTION-A': parse_dict_with_url,
 'SECTION-B': parse_dict,
 'SECTION-C': parse_list,
 'SECTION-D': parse_text,
 }
 with open('279116.txt') as f:
 section_results = dict(parse(sections, f))
 pprint(section_results)
if __name__ == '__main__':
 main()

Output

{'SECTION-A': ({'Parameter1': 'Text 1',
 'Parameter2': 'Text 2',
 'Parameter3': 'Text 3',
 'Parameter4': 'Text 4',
 'Parameter5': 'Text 5',
 'Parameter6': 'Text 6',
 'Parameter7': 'Text 7'},
 'https://www.google.com'),
 'SECTION-B': {'Parameter8': 'Text 8', 'Parameter9': 'Text 9'},
 'SECTION-C': ['Item C1', 'Item C2', 'Item C3', 'Item C4'],
 'SECTION-D': 'Lorem ipsum dolor sit amet, consectetur adipiscing elit,\n'
 'sed do eiusmod tempor incididunt ut labore et dolore magna '
 'aliqua.\n'
 'Gravida cum sociis natoque penatibus et magnis dis parturient '
 'montes.\n'
 'Metus dictum at tempor commodo ullamcorper a lacus vestibulum '
 'sed.'}

AJNeufeld AJNeufeld 35.3k5 gold badges41 silver badges103 bronze badges · Accepted Answer · 2022-08-24 17:19:40Z

Dictionary updates

parameter.update({"URL": url}) is a very verbose way of writing parameter["URL"] = url. Ditto for "Items C" and "Description D" fields.

Unbounded split

parameter[p] = line.split(":")[1] ignores the possibility of the multiple colons in the key-value pair. If you ever have a parameter line which reads:

url : http://google.com/

Then you'll end up assigning http to the url key, not the remainder of the line. Limit the expected number of splits. parameter[p] = line.split(":", 1)[1]

Accidental key-value updates

 for p in parameter:
 if line.startswith(p):
 parameter[p] = line.split(":")[1]

If you have "parameter1" and "parameter10", both values will be updated when a parameter line with the longer key is encountered, like parameter10 : value.

Better would be:

 def assign_parameters(line):
 if ':' in line:
 name, value = line.split(":", 1) # Maximum 1 split
 name = name.rstrip() # Remove trailing whitespace
 if name in parameters:
 parameter[name] = value
 #else:
 # LOG.warning("Unknown parameter %r", name)

This uses the entire name when checking if the parameter is a known. It also eliminates the time wasted looping over every parameter name.

Since you're already including the re module, why not ...

 def assign_parameter(line):
 m = re.match(r"(?P<name>[^ :]+)\s*:(?P<value>.*)", line)
 if m:
 name = m['name']
 if name in parameters:
 parameters[name] = m['value']
 #else:
 # LOG.warning("Unknown parameter %r", name)

Section collisions

If a new section "SECTION-A1" appears in your input file, find_section(line) may unexpectedly return "SECTION-A"!

Unknown sections

If the input format changed, adding a new section after SECTION-C, ...

...
 ---------------------------- SECTION-C ----------------------------
 Item C1
 Item C2
 Item C3
 Item C4
 ---------------------------- SECTION-NEW ------------------------------
 Not a C item
 Another not a C item
...

... the additional lines will continue to be added to items_c!

Based on if len(line) > 1 and not line.startswith("--"):, it seems clear that sections are expected to start with two or more hyphens. You should explicitly detect for section breaks, not just the start of any known section.

 if line.startswith("--") and line.endswith("--"):
 act_section = line.strip("- ")
 #if act_section not in SECTIONS:
 # LOG.warning("Unknown section %r", act_section)
 elif act_section == SECTION[0] or or act_section == SECTIONS[1]:
 ...
 elif act_section == SECTIONS[2]:
 if len(line) > 1:
 items_c.append(line)
 elif act_section == SECTIONS[3]:
 if len(line) > 1:
 text_d.append(line)

Notice that since --... SECTION-NAME ...-- lines are parsed separately, you no longer need to guard against lines starting with -- inside SECTIONS[2] or SECTIONS[3].

Named Sections

What the heck does elif act_section == SECTIONS[2]: mean, anyway? If a new section gets added before "SECTION-C", is it going to be added to the SECTIONS list before that item? If so, you'll have to find everywhere there is SECTIONS[2] and change that to SECTIONS[3], but first you'll have to change all SECTIONS[3] to SECTIONS[4]!

You're using SECTIONS[#] to avoid typing out the whole string multiple times to reduce the chance of typos and ensure it only needs to be changed in one spot if it does require changing, but it is obfuscating the code such that the reader has to refer elsewhere and count to determine what section SECTIONS[2] actually means! It needs to be commented better, but comments aren't actually tested so there is no guarantee the comments are actually correct.

You need:

def parser(lines):
 SECTION_A = "SECTION-A"
 SECTION_B = "SECTION-B"
 SECTION_C = "SECTION-C"
 SECTION_D = "SECTION-D"
 SECTION_END = "END"
 
 SECTIONS = {
 SECTION_A,
 SECTION_B,
 SECTION_C,
 SECTION_D,
 SECTION_END,
 }
 ...
 elif act_section == SECTION_A or act_section == SECTION_B:
 ...
 elif act_section == SECTION_C:
 ...
 elif act_section == SECTION_D:
 ...

No more counting is necessary (or even possible because SECTIONS is now a set).

At this point, it may be worth considering an Enum.

from enum import Enum
class Section(Enum):
 A = "SECTION-A"
 B = "SECTION-B"
 C = "SECTION-C"
 D = "SECTION-D"
 END = "END"

Efficiency

There is no need to read and store the entire file into memory. You can easily process the file line by line.

Stack Exchange Network

Extract data from formatted text file

2 Answers 2

Dictionary updates

Unbounded split

Accidental key-value updates

Section collisions

Unknown sections

Named Sections

Efficiency

Suggested

Output

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Extract data from formatted text file

2 Answers 2

Dictionary updates

Unbounded split

Accidental key-value updates

Section collisions

Unknown sections

Named Sections

Efficiency

Suggested

Output

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions