Parsing parameters for a potential function

Question 1

Background

A forcefield is a collection of functions and parameters that is used to calculate the potential energy of a complex system. I have text files which contain data about the parameters for a forcefield. The text file is split into many sections, with each section following the same format:

A section header which is enclosed in square brackets
On the next line the word indices: followed by a list of integers.
This is then followed by 1 or more lines of parameters associated with the section

Here is a made-up example file to showcase the format.

############################################
# Comments begin with '#'
############################################
[lj_pairs] # Section 1
 indices: 0 2
# ID eps sigma
 1 2.344 1.234 5
 2 4.423 5.313 5
 3 1.573 6.321 5
 4 1.921 11.93 5
[bonds]
indices: 0 1
 2 4.234e-03 11.2
 6 -0.134545 5.7

The goal is to parse such files and store all of the information in a dict.

Code

Main function for review

""" Force-field data reader """
import re
from dataclasses import dataclass, field
from typing import Dict, Iterable, List, TextIO, Tuple, Union, Any
def ff_reader(fname: Union[str, TextIO]) -> Dict[str, "FFSections"]:
 """ Reads data from a force-field file """
 try:
 if _is_string(fname):
 fh = open(fname, mode="r")
 own = True
 else:
 fh = iter(fname)
 except TypeError:
 raise ValueError("fname must be a string or a file handle")
 # All the possible section headers
 keywords = ("lj_pairs", "bonds") # etc... Long list of possible sections
 # Removed for brevity
 re_sections = re.compile(r"^\[(%s)\]$" % "|".join(keywords))
 ff_data = _strip_comments(fh)
 # Empty dict that'll hold all the data.
 final_ff_data = {key: FFSections() for key in keywords}
 # Get first section header
 for line in ff_data:
 match = re.match(re_sections, line)
 if match:
 section = match.group(1)
 in_section_for_first_time = True
 break
 else:
 raise FFReaderError("A valid section header must be the first line in file")
 else:
 raise FFReaderError("No force-field sections exist")
 # Read the rest of the file
 for line in ff_data:
 match = re.match(re_sections, line)
 # If we've encounted a section header the next line must be an index list.
 if in_section_for_first_time:
 if line.split()[0] != "indices:":
 raise FFReaderError(f"Missing index list for section: {section}")
 idx = _validate_indices(line)
 final_ff_data[section].use_idx = idx
 in_section_for_first_time = False
 in_params_for_first_time = True
 continue
 if match and in_params_for_first_time:
 raise FFReaderError(
 f"Section {section} missing parameters"
 + "Sections must contain atleast one type coefficients"
 )
 if match: # and not in_section_for_first_time and in_params_for_first_time
 section = match.group(1)
 in_section_for_first_time = True
 continue
 params = _validate_params(line)
 final_ff_data[section].coeffs.update([params])
 in_params_for_first_time = False
 # Close the file if we opened it
 if own:
 fh.close()
 for section in final_ff_data.values():
 # coeff must exist if use_idx does
 if section.use_idx is not None:
 assert section.coeffs
 return final_ff_data

Other stuff for the code to work

def _strip_comments(
 instream: TextIO, comments: Union[str, Iterable[str], None] = "#"
) -> Iterable[str]:
 """ Strip comments from a text IO stream """
 if comments is not None:
 if isinstance(comments, str):
 comments = [comments]
 comments_re = re.compile("|".join(map(re.escape, comments)))
 try:
 for lines in instream.readlines():
 line = re.split(comments_re, lines, 1)[0].strip()
 if line != "":
 yield line
 except AttributeError:
 raise TypeError("instream must be a `TextIO` stream") from None
@dataclass(eq=False)
class FFSections:
 """
 FFSections(coeffs,use_idx)
 Container for forcefield information
 """
 coeffs: Dict[int, List[float]] = field(default_factory=dict)
 use_idx: List[int] = field(default=None)
class FFReaderError(Exception):
 """ Incorrect or badly formatted force-Field data """
 def __init__(self, message: str, badline: Optional[str] = None) -> None:
 if badline:
 message = f"{message}\nError parsing --> ({badline})"
 super().__init__(message)
def _validate_indices(line: str) -> List[int]:
 """
 Check if given line contains only a whitespace separated
 list of integers
 """
 # split on indices: followed by whitescape
 split = line.split("indices:")[1].split()
 # import ipdb; ipdb.set_trace()
 if not set(s.isdecimal() for s in split) == {True}:
 raise FFReaderError(
 "Indices should be integers and seperated by whitespace", line
 )
 return [int(x) for x in split]
def _validate_params(line: str) -> Tuple[int, List[float]]:
 """
 Check if given line is valid param line, which are
 an integer followed by one or more floats seperated by whitespace
 """
 split = line.split()
 id_ = split[0]
 coeffs = split[1:]
 if not id_.isdecimal():
 raise FFReaderError("Invalid params", line)
 try:
 coeffs = [float(x) for x in coeffs]
 except (TypeError, ValueError):
 raise FFReaderError("Invalid params", line) from None
 return (int(id_), coeffs)

I consider myself a beginner in python and this is my first substantive project. I'd like the review to focus on the ff_reader function, but feel free to comment on the other parts too if there are better ways to do somethings. I feel like the way I've written the ff_reader is kind of ugly and inelegant. I'd be especially interested if there is a better way to read such files, perhaps parsing the whole file instead of line by line.

Question 2

What does the 5 in the lj_pairs section mean? It doesn't have a column label.

Question 3

For this particular example, it is the cut-off distance beyond which the Lennard-Jones potential is 0. I don't think the physical meaning of the values is relevant here but if it is needed I can provide additional context. Also, I just made these values up they don't correspond to any real systems.

Question 4

Thanks. You're right that the physical meaning is not relevant here. I was just missing the name of the column, since the other columns are named so nicely.

Question 5

Possible duplicate of - codereview.stackexchange.com/questions/219740/…

Question 6

@Graipher huh thanks, that is very weird, especially considering someone else said that they saw my question on SO. I'm pretty sure I just made an account here and posted the question. I have no idea how I could have posted it multiple times.

Question 7

I have written parsers for several similar file formats, and one time I started with the same idea as you: iterate over the lines and record the current state in some boolean variables. Over time, these parsers got too large to understand. Therefore I switched to a different strategy: instead of recording the current state in variables, record it implicitly in the code that is currently executed. I structured the parser like this:

def parse_file(lines: Lines):
 sections = []
 while not lines.at_end():
 section = parse_section(lines)
 if section is None:
 break
 sections.append(section)
 return sections
def parse_section(lines: Lines):
 name = lines.must_match(r"^\[(\w+)\]$")[1]
 indices_str = lines.must_match(r"\s*indices:\s*(\d+(\s*\d+))$")[1]
 data = []
 while not lines.at_end():
 row = parse_row(lines)
 if row is None:
 break
 data.append(row)
 indices = map(int, indices_str.split())
 return Section(name, indices, data)

As you can see, each part of the file structure gets its own parsing function. Thereby the code matches the structure of the file format. Each of the functions is relatively small.

To make these functions useful, they need a source of lines, which I called Lines. This would be another class that defines useful function such as must_match, which makes sure the "current line" matches the regular expression, and if it doesn't, it throws a parse error. Using these functions as building blocks, writing and modifying the parser is still possible, even when the file format becomes more complicated.

Another benefit of having these small functions is that you can test them individually. Prepare a Lines object, pass it to the function and see what it returns. This allows for good unit tests.

The Lines class consists of a list of lines and the index of the current line. As you parse the file, the index will advance, until you reach the end of the lines.

Regarding your code:

I don't like the union types very much. They make the code more complicated than necessary. For example, when stripping the comments, you actually only need the single comment marker #. Therefore all the list handling can be removed, and the comment character doesn't need to be a parameter at all.

Stripping the comments at the very beginning is a good strategy since otherwise you would have to repeat that code in several other places.

In that comment removal function you declared that the comment may also be None, but actually passing None will throw an exception.

Be careful when opening files. Every file that is opened must be closed again when it is not needed anymore, even in case of exceptions. Your current code does not close the file when a parse error occurs. This is another reason against union types. It would be easier to have separate functions: one that parses from a list of strings and one that parses from a file. How big are the files, does it hurt to load them into memory as a single block? If they get larger than 10 MB, that would be a valid concern.

Question 8

Thank you for the review. Recording the current state in some boolean variables is exactly what seemed messy to me but I'm not quite sure I follow your alternative approach especially how the current state is recorded implicitly. Are saying that instead of parsing it line by line, I should try to parse an entire section as a whole? Or each part should get it's own function like parse_name, parse_indices and parse_data etc. I'm not sure how this will implicitly record the current state or have mis-understood and it is the lines object that somehow records the state? ....

Question 9

You're right that I don't need any other marker than #, I made it a parameter because I was worried that someone might decide to use something else as comments in the future. I thought I'd be easier to make it a parameter rather than change the function. The files are typically just 5-8MB so I guess I could load them into memory. But to fix the files not closing if there is an exception, should I add a finally block after every try? or use a context manager?

Question 10

You're right, I didn't write the part about the "implicit" state clear enough. By this implicit state I mean: instead of having a boolean variable that saves the state, the code itself contains this information already. If the computer is currently executing the while not lines.at_end(), you already know what the current state is: the parser is either at the beginning of the file, or it just finished a section. +++ Regarding the parse_name: you could do that, or if it is simple enough, just leave it as the one-liner like in the code I suggested. That's up to you.

Question 11

Regarding the try ... finally: It's really difficult with your current code since sometimes you need to close the stream and sometimes you don't. The easiest way of closing files is by using the with statement.

Question 12

Ah ok, I see what you mean now. Thank you!

Question 13

See, I saw your question on StackOverflow the other day before it was moved here but thought to answer nevertheless.

Imo, the way to go is to write yourself a grammar/parser and a NodeVisitor class. This is formulate little parts in a first step and then glue them all together afterwards.

from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
data = """
############################################
# Comments begin with '#'
############################################
[lj_pairs] # Section 1
 indices: 0 2
# ID eps sigma
 1 2.344 1.234 5
 2 4.423 5.313 5
 3 1.573 6.321 5
 4 1.921 11.93 5
[bonds]
indices: 0 1
 2 4.234e-03 11.2
 6 -0.134545 5.7
"""
grammar = Grammar(
 r"""
 expr = (entry / garbage)+
 entry = section garbage indices (valueline / garbage)*
 section = lpar word rpar
 indices = ws? "indices:" values+
 garbage = ((comment / hs)* newline?)*
 word = ~"\w+"
 values = float+
 valueline = values newline?
 float = hs? ~"[-.e\d]+" hs?
 lpar = "["
 rpar = "]"
 comment = ~"#.+"
 ws = ~"\s*"
 hs = ~"[\t\ ]*"
 newline = ~"[\r\n]"
 """
)
tree = grammar.parse(data)
class DataVisitor(NodeVisitor):
 def generic_visit(self, node, visited_children):
 return visited_children or node
 def visit_int(self, node, visited_children):
 _, value,_ = visited_children
 return int(value.text)
 def visit_float(self, node, visited_children):
 _, value, _ = visited_children
 return value.text
 def visit_section(self, node, visited_children):
 _, section, _ = visited_children
 return section.text
 def visit_indices(self, node, visited_children):
 *_, values = visited_children
 return values[0]
 def visit_valueline(self, node, visited_children):
 values, _ = visited_children
 return values
 def visit_garbage(self, node, visited_children):
 return None
 def visit_entry(self, node, visited_children):
 section, _, indices, lst = visited_children
 values = [item[0] for item in lst if item[0]]
 return (section, {'indices': indices, 'values': values})
 def visit_expr(self, node, visited_children):
 return dict([item[0] for item in visited_children if item[0]])
d = DataVisitor()
out = d.visit(tree)
print(out)

Which will yield

{
 'lj_pairs': {'indices': ['0', '2'], 'values': [['1', '2.344', '1.234', '5'], ['2', '4.423', '5.313', '5'], ['3', '1.573', '6.321', '5'], ['4', '1.921', '11.93', '5']]}, 
 'bonds': {'indices': ['0', '1'], 'values': [['2', '4.234e-03', '11.2'], ['6', '-0.134545', '5.7']]}
}

If you - or anybody else - are interested, I'd add some explanation as well.

Question 14

This would be a great answer on SO, is there anyway you can address the users code to improve it as well as suggesting an alternate implementation. Keep in mind we review code here and don't provide how to answers.

Question 15

Thanks for the answer. I'm not familiar with parsimonious so I'd be more than grateful for an explanation especially as this seems somewhat cleaner. If you can't reformulate the answer to be more in line with CR standards. I can ask the same question on SO (if I can modify to be in line for SO) and you can answer there.

Question 16

@voidlife: Post it on SO, add the link here and I'd add some explanation then.

Question 17

@Jan sorry, was a bit busy. Posted it on here on SO

Roland Illig Roland Illig 21.9k2 gold badges36 silver badges83 bronze badges · Accepted Answer · 2019-05-05 05:57:00Z

I have written parsers for several similar file formats, and one time I started with the same idea as you: iterate over the lines and record the current state in some boolean variables. Over time, these parsers got too large to understand. Therefore I switched to a different strategy: instead of recording the current state in variables, record it implicitly in the code that is currently executed. I structured the parser like this:

def parse_file(lines: Lines):
 sections = []
 while not lines.at_end():
 section = parse_section(lines)
 if section is None:
 break
 sections.append(section)
 return sections
def parse_section(lines: Lines):
 name = lines.must_match(r"^\[(\w+)\]$")[1]
 indices_str = lines.must_match(r"\s*indices:\s*(\d+(\s*\d+))$")[1]
 data = []
 while not lines.at_end():
 row = parse_row(lines)
 if row is None:
 break
 data.append(row)
 indices = map(int, indices_str.split())
 return Section(name, indices, data)

As you can see, each part of the file structure gets its own parsing function. Thereby the code matches the structure of the file format. Each of the functions is relatively small.

To make these functions useful, they need a source of lines, which I called Lines. This would be another class that defines useful function such as must_match, which makes sure the "current line" matches the regular expression, and if it doesn't, it throws a parse error. Using these functions as building blocks, writing and modifying the parser is still possible, even when the file format becomes more complicated.

Another benefit of having these small functions is that you can test them individually. Prepare a Lines object, pass it to the function and see what it returns. This allows for good unit tests.

The Lines class consists of a list of lines and the index of the current line. As you parse the file, the index will advance, until you reach the end of the lines.

Regarding your code:

I don't like the union types very much. They make the code more complicated than necessary. For example, when stripping the comments, you actually only need the single comment marker #. Therefore all the list handling can be removed, and the comment character doesn't need to be a parameter at all.

Stripping the comments at the very beginning is a good strategy since otherwise you would have to repeat that code in several other places.

In that comment removal function you declared that the comment may also be None, but actually passing None will throw an exception.

Be careful when opening files. Every file that is opened must be closed again when it is not needed anymore, even in case of exceptions. Your current code does not close the file when a parse error occurs. This is another reason against union types. It would be easier to have separate functions: one that parses from a list of strings and one that parses from a file. How big are the files, does it hurt to load them into memory as a single block? If they get larger than 10 MB, that would be a valid concern.

Thank you for the review. Recording the current state in some boolean variables is exactly what seemed messy to me but I'm not quite sure I follow your alternative approach especially how the current state is recorded implicitly. Are saying that instead of parsing it line by line, I should try to parse an entire section as a whole? Or each part should get it's own function like parse_name, parse_indices and parse_data etc. I'm not sure how this will implicitly record the current state or have mis-understood and it is the lines object that somehow records the state? ....
You're right that I don't need any other marker than #, I made it a parameter because I was worried that someone might decide to use something else as comments in the future. I thought I'd be easier to make it a parameter rather than change the function. The files are typically just 5-8MB so I guess I could load them into memory. But to fix the files not closing if there is an exception, should I add a finally block after every try? or use a context manager?
You're right, I didn't write the part about the "implicit" state clear enough. By this implicit state I mean: instead of having a boolean variable that saves the state, the code itself contains this information already. If the computer is currently executing the while not lines.at_end(), you already know what the current state is: the parser is either at the beginning of the file, or it just finished a section. +++ Regarding the parse_name: you could do that, or if it is simple enough, just leave it as the one-liner like in the code I suggested. That's up to you.
Regarding the try ... finally: It's really difficult with your current code since sometimes you need to close the stream and sometimes you don't. The easiest way of closing files is by using the with statement.

Stack Exchange Network

Parsing parameters for a potential function

Background

Code

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parsing parameters for a potential function

Background

Code

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions