Background
A forcefield is a collection of functions and parameters that is used to calculate the potential energy of a complex system. I have text files which contain data about the parameters for a forcefield. The text file is split into many sections, with each section following the same format:
- A section header which is enclosed in square brackets
- On the next line the word
indices:
followed by a list of integers. - This is then followed by 1 or more lines of parameters associated with the section
Here is a made-up example file to showcase the format.
############################################
# Comments begin with '#'
############################################
[lj_pairs] # Section 1
indices: 0 2
# ID eps sigma
1 2.344 1.234 5
2 4.423 5.313 5
3 1.573 6.321 5
4 1.921 11.93 5
[bonds]
indices: 0 1
2 4.234e-03 11.2
6 -0.134545 5.7
The goal is to parse such files and store all of the information in a dict
.
Code
Main function for review
""" Force-field data reader """
import re
from dataclasses import dataclass, field
from typing import Dict, Iterable, List, TextIO, Tuple, Union, Any
def ff_reader(fname: Union[str, TextIO]) -> Dict[str, "FFSections"]:
""" Reads data from a force-field file """
try:
if _is_string(fname):
fh = open(fname, mode="r")
own = True
else:
fh = iter(fname)
except TypeError:
raise ValueError("fname must be a string or a file handle")
# All the possible section headers
keywords = ("lj_pairs", "bonds") # etc... Long list of possible sections
# Removed for brevity
re_sections = re.compile(r"^\[(%s)\]$" % "|".join(keywords))
ff_data = _strip_comments(fh)
# Empty dict that'll hold all the data.
final_ff_data = {key: FFSections() for key in keywords}
# Get first section header
for line in ff_data:
match = re.match(re_sections, line)
if match:
section = match.group(1)
in_section_for_first_time = True
break
else:
raise FFReaderError("A valid section header must be the first line in file")
else:
raise FFReaderError("No force-field sections exist")
# Read the rest of the file
for line in ff_data:
match = re.match(re_sections, line)
# If we've encounted a section header the next line must be an index list.
if in_section_for_first_time:
if line.split()[0] != "indices:":
raise FFReaderError(f"Missing index list for section: {section}")
idx = _validate_indices(line)
final_ff_data[section].use_idx = idx
in_section_for_first_time = False
in_params_for_first_time = True
continue
if match and in_params_for_first_time:
raise FFReaderError(
f"Section {section} missing parameters"
+ "Sections must contain atleast one type coefficients"
)
if match: # and not in_section_for_first_time and in_params_for_first_time
section = match.group(1)
in_section_for_first_time = True
continue
params = _validate_params(line)
final_ff_data[section].coeffs.update([params])
in_params_for_first_time = False
# Close the file if we opened it
if own:
fh.close()
for section in final_ff_data.values():
# coeff must exist if use_idx does
if section.use_idx is not None:
assert section.coeffs
return final_ff_data
Other stuff for the code to work
def _strip_comments(
instream: TextIO, comments: Union[str, Iterable[str], None] = "#"
) -> Iterable[str]:
""" Strip comments from a text IO stream """
if comments is not None:
if isinstance(comments, str):
comments = [comments]
comments_re = re.compile("|".join(map(re.escape, comments)))
try:
for lines in instream.readlines():
line = re.split(comments_re, lines, 1)[0].strip()
if line != "":
yield line
except AttributeError:
raise TypeError("instream must be a `TextIO` stream") from None
@dataclass(eq=False)
class FFSections:
"""
FFSections(coeffs,use_idx)
Container for forcefield information
"""
coeffs: Dict[int, List[float]] = field(default_factory=dict)
use_idx: List[int] = field(default=None)
class FFReaderError(Exception):
""" Incorrect or badly formatted force-Field data """
def __init__(self, message: str, badline: Optional[str] = None) -> None:
if badline:
message = f"{message}\nError parsing --> ({badline})"
super().__init__(message)
def _validate_indices(line: str) -> List[int]:
"""
Check if given line contains only a whitespace separated
list of integers
"""
# split on indices: followed by whitescape
split = line.split("indices:")[1].split()
# import ipdb; ipdb.set_trace()
if not set(s.isdecimal() for s in split) == {True}:
raise FFReaderError(
"Indices should be integers and seperated by whitespace", line
)
return [int(x) for x in split]
def _validate_params(line: str) -> Tuple[int, List[float]]:
"""
Check if given line is valid param line, which are
an integer followed by one or more floats seperated by whitespace
"""
split = line.split()
id_ = split[0]
coeffs = split[1:]
if not id_.isdecimal():
raise FFReaderError("Invalid params", line)
try:
coeffs = [float(x) for x in coeffs]
except (TypeError, ValueError):
raise FFReaderError("Invalid params", line) from None
return (int(id_), coeffs)
I consider myself a beginner in python and this is my first substantive project. I'd like the review to focus on the ff_reader
function, but feel free to comment on the other parts too if there are better ways to do somethings. I feel like the way I've written the ff_reader
is kind of ugly and inelegant. I'd be especially interested if there is a better way to read such files, perhaps parsing the whole file instead of line by line.
-
\$\begingroup\$ What does the 5 in the lj_pairs section mean? It doesn't have a column label. \$\endgroup\$Roland Illig– Roland Illig2019年05月05日 05:11:31 +00:00Commented May 5, 2019 at 5:11
-
\$\begingroup\$ For this particular example, it is the cut-off distance beyond which the Lennard-Jones potential is 0. I don't think the physical meaning of the values is relevant here but if it is needed I can provide additional context. Also, I just made these values up they don't correspond to any real systems. \$\endgroup\$void life– void life2019年05月05日 05:36:29 +00:00Commented May 5, 2019 at 5:36
-
\$\begingroup\$ Thanks. You're right that the physical meaning is not relevant here. I was just missing the name of the column, since the other columns are named so nicely. \$\endgroup\$Roland Illig– Roland Illig2019年05月05日 06:04:53 +00:00Commented May 5, 2019 at 6:04
-
\$\begingroup\$ Possible duplicate of - codereview.stackexchange.com/questions/219740/… \$\endgroup\$Justin– Justin2019年05月05日 09:06:56 +00:00Commented May 5, 2019 at 9:06
-
1\$\begingroup\$ @Graipher huh thanks, that is very weird, especially considering someone else said that they saw my question on SO. I'm pretty sure I just made an account here and posted the question. I have no idea how I could have posted it multiple times. \$\endgroup\$void life– void life2019年05月06日 17:03:23 +00:00Commented May 6, 2019 at 17:03
2 Answers 2
I have written parsers for several similar file formats, and one time I started with the same idea as you: iterate over the lines and record the current state in some boolean variables. Over time, these parsers got too large to understand. Therefore I switched to a different strategy: instead of recording the current state in variables, record it implicitly in the code that is currently executed. I structured the parser like this:
def parse_file(lines: Lines):
sections = []
while not lines.at_end():
section = parse_section(lines)
if section is None:
break
sections.append(section)
return sections
def parse_section(lines: Lines):
name = lines.must_match(r"^\[(\w+)\]$")[1]
indices_str = lines.must_match(r"\s*indices:\s*(\d+(\s*\d+))$")[1]
data = []
while not lines.at_end():
row = parse_row(lines)
if row is None:
break
data.append(row)
indices = map(int, indices_str.split())
return Section(name, indices, data)
As you can see, each part of the file structure gets its own parsing function. Thereby the code matches the structure of the file format. Each of the functions is relatively small.
To make these functions useful, they need a source of lines, which I called Lines
. This would be another class that defines useful function such as must_match
, which makes sure the "current line" matches the regular expression, and if it doesn't, it throws a parse error. Using these functions as building blocks, writing and modifying the parser is still possible, even when the file format becomes more complicated.
Another benefit of having these small functions is that you can test them individually. Prepare a Lines object, pass it to the function and see what it returns. This allows for good unit tests.
The Lines
class consists of a list of lines and the index of the current line. As you parse the file, the index will advance, until you reach the end of the lines.
Regarding your code:
I don't like the union types very much. They make the code more complicated than necessary. For example, when stripping the comments, you actually only need the single comment marker #
. Therefore all the list handling can be removed, and the comment character doesn't need to be a parameter at all.
Stripping the comments at the very beginning is a good strategy since otherwise you would have to repeat that code in several other places.
In that comment removal function you declared that the comment may also be None
, but actually passing None
will throw an exception.
Be careful when opening files. Every file that is opened must be closed again when it is not needed anymore, even in case of exceptions. Your current code does not close the file when a parse error occurs. This is another reason against union types. It would be easier to have separate functions: one that parses from a list of strings and one that parses from a file. How big are the files, does it hurt to load them into memory as a single block? If they get larger than 10 MB, that would be a valid concern.
-
\$\begingroup\$ Thank you for the review. Recording the current state in some boolean variables is exactly what seemed messy to me but I'm not quite sure I follow your alternative approach especially how the current state is recorded implicitly. Are saying that instead of parsing it line by line, I should try to parse an entire
section
as a whole? Or each part should get it's own function likeparse_name
,parse_indices
andparse_data
etc. I'm not sure how this will implicitly record the current state or have mis-understood and it is thelines
object that somehow records the state? .... \$\endgroup\$void life– void life2019年05月05日 16:30:04 +00:00Commented May 5, 2019 at 16:30 -
\$\begingroup\$ You're right that I don't need any other marker than
#
, I made it a parameter because I was worried that someone might decide to use something else as comments in the future. I thought I'd be easier to make it a parameter rather than change the function. The files are typically just 5-8MB so I guess I could load them into memory. But to fix the files not closing if there is an exception, should I add afinally
block after everytry
? or use a context manager? \$\endgroup\$void life– void life2019年05月05日 16:36:38 +00:00Commented May 5, 2019 at 16:36 -
\$\begingroup\$ You're right, I didn't write the part about the "implicit" state clear enough. By this implicit state I mean: instead of having a boolean variable that saves the state, the code itself contains this information already. If the computer is currently executing the
while not lines.at_end()
, you already know what the current state is: the parser is either at the beginning of the file, or it just finished a section. +++ Regarding theparse_name
: you could do that, or if it is simple enough, just leave it as the one-liner like in the code I suggested. That's up to you. \$\endgroup\$Roland Illig– Roland Illig2019年05月05日 16:47:55 +00:00Commented May 5, 2019 at 16:47 -
\$\begingroup\$ Regarding the
try ... finally
: It's really difficult with your current code since sometimes you need to close the stream and sometimes you don't. The easiest way of closing files is by using thewith
statement. \$\endgroup\$Roland Illig– Roland Illig2019年05月05日 16:51:15 +00:00Commented May 5, 2019 at 16:51 -
\$\begingroup\$ Ah ok, I see what you mean now. Thank you! \$\endgroup\$void life– void life2019年05月05日 16:53:03 +00:00Commented May 5, 2019 at 16:53
See, I saw your question on StackOverflow the other day before it was moved here but thought to answer nevertheless.
Imo, the way to go is to write yourself a grammar/parser and a NodeVisitor
class. This is formulate little parts in a first step and then glue them all together afterwards.
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
data = """
############################################
# Comments begin with '#'
############################################
[lj_pairs] # Section 1
indices: 0 2
# ID eps sigma
1 2.344 1.234 5
2 4.423 5.313 5
3 1.573 6.321 5
4 1.921 11.93 5
[bonds]
indices: 0 1
2 4.234e-03 11.2
6 -0.134545 5.7
"""
grammar = Grammar(
r"""
expr = (entry / garbage)+
entry = section garbage indices (valueline / garbage)*
section = lpar word rpar
indices = ws? "indices:" values+
garbage = ((comment / hs)* newline?)*
word = ~"\w+"
values = float+
valueline = values newline?
float = hs? ~"[-.e\d]+" hs?
lpar = "["
rpar = "]"
comment = ~"#.+"
ws = ~"\s*"
hs = ~"[\t\ ]*"
newline = ~"[\r\n]"
"""
)
tree = grammar.parse(data)
class DataVisitor(NodeVisitor):
def generic_visit(self, node, visited_children):
return visited_children or node
def visit_int(self, node, visited_children):
_, value,_ = visited_children
return int(value.text)
def visit_float(self, node, visited_children):
_, value, _ = visited_children
return value.text
def visit_section(self, node, visited_children):
_, section, _ = visited_children
return section.text
def visit_indices(self, node, visited_children):
*_, values = visited_children
return values[0]
def visit_valueline(self, node, visited_children):
values, _ = visited_children
return values
def visit_garbage(self, node, visited_children):
return None
def visit_entry(self, node, visited_children):
section, _, indices, lst = visited_children
values = [item[0] for item in lst if item[0]]
return (section, {'indices': indices, 'values': values})
def visit_expr(self, node, visited_children):
return dict([item[0] for item in visited_children if item[0]])
d = DataVisitor()
out = d.visit(tree)
print(out)
Which will yield
{
'lj_pairs': {'indices': ['0', '2'], 'values': [['1', '2.344', '1.234', '5'], ['2', '4.423', '5.313', '5'], ['3', '1.573', '6.321', '5'], ['4', '1.921', '11.93', '5']]},
'bonds': {'indices': ['0', '1'], 'values': [['2', '4.234e-03', '11.2'], ['6', '-0.134545', '5.7']]}
}
If you - or anybody else - are interested, I'd add some explanation as well.
-
\$\begingroup\$ This would be a great answer on SO, is there anyway you can address the users code to improve it as well as suggesting an alternate implementation. Keep in mind we review code here and don't provide how to answers. \$\endgroup\$2019年05月06日 14:49:38 +00:00Commented May 6, 2019 at 14:49
-
\$\begingroup\$ Thanks for the answer. I'm not familiar with
parsimonious
so I'd be more than grateful for an explanation especially as this seems somewhat cleaner. If you can't reformulate the answer to be more in line with CR standards. I can ask the same question on SO (if I can modify to be in line for SO) and you can answer there. \$\endgroup\$void life– void life2019年05月06日 17:04:56 +00:00Commented May 6, 2019 at 17:04 -
\$\begingroup\$ @voidlife: Post it on
SO
, add the link here and I'd add some explanation then. \$\endgroup\$Jan– Jan2019年05月06日 17:38:57 +00:00Commented May 6, 2019 at 17:38 -