Checking hierarchical config files for hashed id and informing the user of changes

Question 1

Our testing suite is configured using a type of hierarchical config file similar to .ini or .toml files. Each 'block' is a test that is run on our cluster. I'm working on adding unique ids to each test that will help map results to tests over time. A small example of a config file would be:

[Tests]
 [test1]
 type = CSVDIFF
 input = test1.i
 output = 'test1_chkfile.csv'
 max_tol = 1.0e-10
 unique_id = fa3acd397ae0d633194702ba6982ee93da09b835945845771256f19f44816f31
 []
 [test2]
 type = CSVDIFF
 input = test2.i
 output = 'test2_chkfile.csv'
 []
[]

The idea is to check all of these files and for tests that don't have an id, create it and write it to the file. I would like a quick code-review on my style, convention, and logic. Here is the script that will be run as a part of the CI precheck:

#!/usr/bin/env python3
import hashlib
import sys
from glob import glob
from textwrap import dedent
from time import time
from collections import UserDict
from typing import List, AnyStr, Any
import pyhit # type: ignore
class StrictDict(UserDict):
 """Custom dictionary class that raises ValueError on duplicate keys.
 This class inherits from collections.UserDict, which is the proper
 way to create a subclass inheriting from `dict`. This dictionary
 will raise an error if it is given an `unique_id` key that it has
 previously indexed. Otherwise, it provides all methods and features
 of a standard dictionary.
 """
 def __setitem__(self, key: Any, value: Any) -> None:
 try:
 current_vals = self.__getitem__(key)
 raise ValueError(
 dedent(
 f"""\
 Duplicate key '{key}' found!
 First id found in {current_vals[0]}, line {current_vals[1]}.
 Duplicate id found in {value[0]}, line {value[1]}.\
 """
 )
 )
 except KeyError:
 self.data[key] = value
def hashnode(node: pyhit.Node) -> str:
 """Return a sha256 hash of spec block to be used as a unique id."""
 # time() returns the number of seconds since Jan. 1st, 1970 in UTC.
 hash_str = node.fullpath + str(time()) + node.render()
 sha_signature = hashlib.sha256(hash_str.encode()).hexdigest()
 return sha_signature
def fetchnodes(root: pyhit.Node) -> List[pyhit.Node]:
 """Return a list of children nodes that will either have or need ids."""
 nodes = []
 for node in root.children[0].descendants:
 # Ensure we only grab blocks that contain specification vars.
 if node.get("type") is None:
 continue
 nodes.append(node)
 return nodes
def indexnodes(file_paths: List[AnyStr]) -> StrictDict:
 """Return dictionary containing a list of nodes for every file."""
 node_dict = StrictDict()
 for file_path in file_paths:
 root = pyhit.load(file_path)
 node_dict[(file_path, root)] = fetchnodes(root)
 return node_dict
def indexids(node_dict: StrictDict) -> StrictDict:
 """Return a dictionary of ids containing file and line info."""
 id_dict = StrictDict()
 for (file_path, _), nodes in node_dict.items():
 for node in nodes:
 unique_id = node.get("unique_id")
 if unique_id is None:
 continue
 else:
 id_dict[unique_id] = (file_path, node.line("unique_id"))
 return id_dict
def writeids(node_dict: StrictDict, id_dict: StrictDict) -> int:
 """Return number of files written that needed a hashed id."""
 num = 0
 for (file_path, root), nodes in node_dict.items():
 # Assume we won't need to write any new files
 write_p = False
 for node in nodes:
 if node.get('unique_id') is None:
 hash_str = hashnode(node)
 node['unique_id'] = hash_str
 id_dict[hash_str] = (file_path, node.line("unique_id"))
 write_p = True
 if write_p:
 pyhit.write(file_path, root)
 num += 1
 return num
def main():
 """Driving function for script."""
 # Make sure to run script in root of BISON.
 assessment_specs = glob("./assessment/**/*/assessment", recursive=True)
 spec_dict = indexnodes(assessment_specs)
 id_dict = indexids(spec_dict)
 num_files_written = writeids(spec_dict, id_dict)
 if num_files_written > 0:
 print("Your code requires assessment file changes.")
 print("You can run ./scripts/unique_assessment_id.py in the top level of your repository.")
 print("Then commit the changes and resubmit.")
 return 1
 return 0
if __name__ == "__main__":
 sys.exit(main())

Question 2

Generators

In this:

nodes = []
for node in root.children[0].descendants:
 # Ensure we only grab blocks that contain specification vars.
 if node.get("type") is None:
 continue
 nodes.append(node)
return nodes

you do not need to construct a list like this. Instead, perhaps

return (
 node for node in root.children[0].descendants
 if node.get('type') is not None
)

This will remain a generator until it is materialized to a list or tuple, etc., which you might not need if you iterate over the results once.

Question 3

Suppose I'm trying to modify those nodes placed in the generator. If I loop through the generator and modify those nodes, will it be saved in memory when I later write the modifications to the file?

Question 4

I'm not completely clear on what you're asking, but you can either cast to a list, modify it and then write to a file; or you can iterate to a second generator with your modifications and then write to a file. The first method requires O(n) in memory and the second O(1).

Question 5

Great, followed your advice and was able to refactor so I only iterate once. I totally forgot that generators get "used up" after looping through them. Thanks.

Reinderien Reinderien 71k5 gold badges76 silver badges256 bronze badges · Accepted Answer · 2020-06-23 16:56:30Z

1

\$\begingroup\$

Generators

In this:

nodes = []
for node in root.children[0].descendants:
 # Ensure we only grab blocks that contain specification vars.
 if node.get("type") is None:
 continue
 nodes.append(node)
return nodes

you do not need to construct a list like this. Instead, perhaps

return (
 node for node in root.children[0].descendants
 if node.get('type') is not None
)

This will remain a generator until it is materialized to a list or tuple, etc., which you might not need if you iterate over the results once.

Share

answered Jun 23, 2020 at 16:56

Reinderien's user avatar

Reinderien Reinderien

71k5 gold badges76 silver badges256 bronze badges

\$\endgroup\$

3

\$\begingroup\$ Suppose I'm trying to modify those nodes placed in the generator. If I loop through the generator and modify those nodes, will it be saved in memory when I later write the modifications to the file? \$\endgroup\$

dylanjm
– dylanjm

2020年06月23日 19:56:25 +00:00
Commented Jun 23, 2020 at 19:56
\$\begingroup\$ I'm not completely clear on what you're asking, but you can either cast to a list, modify it and then write to a file; or you can iterate to a second generator with your modifications and then write to a file. The first method requires O(n) in memory and the second O(1). \$\endgroup\$

Reinderien
– Reinderien

2020年06月23日 19:58:14 +00:00
Commented Jun 23, 2020 at 19:58
1

\$\begingroup\$ Great, followed your advice and was able to refactor so I only iterate once. I totally forgot that generators get "used up" after looping through them. Thanks. \$\endgroup\$

dylanjm
– dylanjm

2020年06月24日 16:14:25 +00:00
Commented Jun 24, 2020 at 16:14

Add a comment |

Stack Exchange Network

Checking hierarchical config files for hashed id and informing the user of changes

1 Answer 1

Generators

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Checking hierarchical config files for hashed id and informing the user of changes

1 Answer 1

Generators

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions