I am a physics PhD student who co-developped a python package for file conversion of standard file format used in Scanning Probe Microscopy (SPM) and archiving of data manipulation for tracability. If you want to know more about the package, an article was published there: https://doi.org/10.1016/j.ultramic.2021.113345
In summary, the goal of the project is to have a well defined file structure inside an hdf5 file which will contain three root folder (h5py.Groups): datasets, metadata and process. Datasets will contain raw data extracted from an existing file in a standard SPM format; metadata will contain the existing metadata of the existing file; and process will contain all the data manipulation that have been done on the raw data (or existing processed data), with attributes saving all the necessary information for data manipulation tracability.
The initial version of the code was built without thinking too much about best practices and expandability in mind, so I am currently doing a "big rewrite".
I am almost done with a class called hyFile
which is a wrapper class around h5py.File
which will handle:
- The file conversion (from a standard SPM format to hdf5)
- The reading and writting of data from the hdf5 file (using the existing package h5py)
- The tracability of processed data through the apply function, which take an arbitrary function and the path to the data that need processing, and save the result into the process folder.
It contains all the necessary dunder methods to write/read data in the hdf5 file, and to be a context manager. It also contains an internal class which allows the manipulation of hdf5 attributes.
import h5py
import pathlib
import warnings
from . import ibw_files, gsf_files
import inspect
import numpy as np
from datetime import datetime
import types
class hyFile:
class Attributes:
def __init__(self, file):
self.file = file
def __contains__(self, path: str) -> bool:
return path in self.file
def __getitem__(self, path: str | None = None) -> dict:
if path is None:
f = self.file[""]
if path != "":
f = self.file[path]
return {key: f.attrs[key] for key in f.attrs.keys()}
def __setitem__(self, path: str, attribute: tuple[str, any]) -> None:
self.file[path].attrs[attribute[0]] = attribute[1]
def __init__(self, path, mode="r+"):
if isinstance(path, str):
self.path = pathlib.Path(path)
else:
self.path = path
if not self.path.is_file():
self.file = h5py.File(self.path, "a")
for group in ["datasets", "metadata", "process"]:
self._require_group(group)
if mode != "a":
self.file.close()
self.file = h5py.File(self.path, mode)
else:
self.file = h5py.File(self.path, mode)
root_struct = set(self.file.keys())
if root_struct != {"datasets", "metadata", "process"}:
warnings.warn(
f"Root structure of the hdf5 file is not composed of 'datasets', 'metadata' and 'process'. \n It may not have been created with Hystorian."
)
self.attrs = self.Attributes(self.file)
def __enter__(self):
return self
def __exit__(self, type, value, traceback) -> None:
if self.file:
self.file.close
if value is not None:
warnings.warn(f"File closed with the following error: {type} - {value} \n {traceback}")
def __getitem__(self, path: str = "") -> h5py.Group | h5py.Dataset:
if path == "":
return self.file
else:
return self.file[path]
def __setitem__(self, data: tuple[str, any], overwrite=True) -> None:
self._create_dataset(data, overwrite)
def __delitem__(self, path: str) -> None:
if path not in self.file:
raise KeyError("Path does not exist in the file.")
del self.file[path]
def __contains__(self, path: str) -> bool:
return path in self.file
def read(self, path: str = ""):
if path == "":
return self.file.keys()
else:
if isinstance(self.file[path], h5py.Group):
return self.file[path].keys()
else:
return self.file[path][()]
def extract_data(self, path: str):
conversion_fcts = {
".gsf": gsf_files.extract_gsf,
".ibw": ibw_files.extract_ibw,
}
if isinstance(path, str):
path = pathlib.Path(path)
if path.suffix in conversion_fcts:
data, metadata, attributes = conversion_fcts[path.suffix](path)
self._write_extracted_data(path, data, metadata, attributes)
else:
raise TypeError(f"{path.suffix} file don't have a conversion function.")
def apply(
self,
function: callable,
inputs: list[str] | str,
output_names: list[str] | str | None = None,
increment_proc: bool = True,
**kwargs,
):
def convert_to_list(inputs):
if isinstance(inputs, list):
return inputs
return [inputs]
inputs = convert_to_list(inputs)
if output_names is None:
output_names = inputs[0].rsplit("/", 1)[1]
output_names = convert_to_list(output_names)
result = function(*inputs, **kwargs)
if result is None:
return None
if not isinstance(result, tuple):
result = tuple([result])
if len(output_names) != len(result):
raise Exception("Error: Unequal amount of outputs and output names")
num_proc = len(self.read("process"))
if increment_proc or f"{str(num_proc).zfill(3)}-{function.__name__}" not in self.read("process"):
num_proc += 1
out_folder_location = f"{'process'}/{str(num_proc).zfill(3)}-{function.__name__}"
for name, data in zip(output_names, result):
self._create_dataset((f"{out_folder_location}/{name}", data))
self._write_generic_attributes(f"{out_folder_location}/{name}", inputs, name, function)
self._write_kwargs_as_attributes(f"{out_folder_location}/{name}", function, kwargs, first_kwarg=len(inputs))
def _write_generic_attributes(
self, out_folder_location: str, in_paths: list[str] | str, output_name: str, function: callable
) -> None:
if not isinstance(in_paths, list):
in_paths = [in_paths]
operation_name = out_folder_location.split("/")[1]
new_attrs = {
"path": out_folder_location + output_name,
"shape": np.shape(self[out_folder_location]),
"name": output_name,
}
if function.__module__ is None:
new_attrs["operation name"] = "None." + function.__name__
else:
new_attrs["operation name"] = function.__module__ + "." + function.__name__
if function.__module__ == "__main__":
new_attrs["function code"] = inspect.getsource(function)
new_attrs["operation number"] = operation_name.split("-")[0]
new_attrs["time"] = str(datetime.datetime.now())
new_attrs["source"] = in_paths
self.attrs[out_folder_location] = new_attrs
def _write_kwargs_as_attributes(
self, path: str, func: callable, all_variables: dict[str, any], first_kwarg: int = 1
) -> None:
attr_dict = {}
if isinstance(func, types.BuiltinFunctionType):
attr_dict["BuiltinFunctionType"] = True
else:
signature = inspect.signature(func).parameters
var_names = list(signature.keys())[first_kwarg:]
for key in var_names:
if key in all_variables:
value = all_variables[key]
elif isinstance(signature[key].default, np._globals._NoValueType):
value = "None"
else:
value = signature[key].default
if callable(value):
value = value.__name__
elif value is None:
value = "None"
try:
attr_dict[f"kwargs_{key}"] = value
except RuntimeError:
RuntimeWarning("Attribute was not able to be saved, probably because the attribute is too large")
attr_dict["kwargs_" + key] = "None"
self.attrs[path] = attr_dict
def _require_group(self, name: str):
self.file.require_group(name)
def _create_dataset(self, data: tuple[str, any], overwrite=True) -> None:
if data[0] in self.file:
if overwrite:
del self.file[data[0]]
else:
raise KeyError("Key already exist and overwriste is set to False.")
self.file.create_dataset(data[0], data=data[1])
def _write_extracted_data(self, path, data, metadata, attributes) -> None:
self._require_group(f"datasets/{path.stem}")
for d_key, d_value in data.items():
self._create_dataset((f"datasets/{path.stem}/{d_key}", d_value), overwrite=True)
for attr in attributes[d_key].items():
self.attrs[f"datasets/{path.stem}/{d_key}"] = attr
if isinstance(metadata, dict):
for key in metadata:
self._create_dataset((f"metadata/{path.stem}/{key}", metadata[key]))
else:
self._create_dataset((f"metadata/{path.stem}", metadata))
Here is a few of the questions that I have.
- As it is right now, the class is a context manager, which open the hdf5 file at the creation of the object and it stays open as long as the object exist. It simplifies a lot the manipulation of it, since then I can just pass the file object along. However would this possibly create conflict with the os trying to access the file?
- I have the feeling that the class is doing too much at the same time: creating file, read/write, conversion, data manipulation. However I am not sure how I would split it into smaller chunck
- I am not very happy with the
apply
function. As it is know, it takes a function, the inputs for it, the name of the outputs to be saved, and a list of **kwargs that are passed to the function. I am not sure how to make it as general as possible but not bloated: In theory any kind of function could be passed to it, in practice most of the one passed will manipulate some kind of numpy array, and give as a result also a numpy array. - I am also not sure about the
extract_data
it contains a dict of function that extract data for multiple file formats and always return three dictionnaries :data
,metadata
andattributes
. Is this a good way to do it?
1 Answer 1
Thank you for your work to advance how folks take images of tiny things!
I'd like to comment on something outside of the submission:
def m_apply(filename, function, ...):
"""
Take any function and handles the inputs of the function by looking into the hdf5 file
Also write the output of the function into the hdf5.
I found the project documentation frustrating, including
those two sentences.
At first blush it appeared to me that m_apply
was central
to the project, a core concept that I should understand at once
before moving on to other details.
I eventually formed the opinion that I should probably be
mentally pronouncing it "matrix apply", that is,
applying some function
to a named input matrix.
But I am sad that neither the tutorial, "readthedocs" text,
nor this docstring ever mentions that.
Moving on, both the tutorial and the (short!) source code
explained to me that l_apply
should be read as "list apply",
which repeatedly invokes m_apply
. So that much is kind of cool.
Though again, the """docstring""" could be improved.
Ok, back to the submitted code.
You have a bunch of imports up front.
Please run isort *.py
to clean them up.
Why does conforming to
pep-8
matter? Because it makes it easier for
the Gentle Reader to see what's going on.
Plus, it reduces merge conflicts when
two separate feature branches each add
their own new deps.
It would be more convenient to phrase it this way:
from pathlib import Path
class hyFile:
No. Please
spell
it HyFile
.
Do not create a Public API with a misspelled class name.
def __init__(self, file):
This is
astonishing.
Perhaps file
is the appropriate identifier,
I'm not certain about that topic yet.
But if you're going to call it file
, then
you absolutely must add an "optional" type annotation
to this signature.
As written, I expected that it meant something like
def __init__(self, file: Path):
or
def __init__(self, file: str):
or conceivably some open file handle.
Explaining it in a """docstring""" would not be enough -- you need
to annotate the file: dict
type if you want clear communication
with your readers.
def __contains__(self, path: str) -> bool:
I am a little sad we don't see path: Path
, but OK, whatever.
def __getitem__(self, path: str | None = None) -> dict:
This is lovely.
Maybe consider using the recently introduced Optional[str]
notation.
GvR claims "there should be one obvious way to do it!",
and I think that's the form currently preferred.
if path is None:
f = self.file[""]
if path != "":
f = self.file[path]
return {key: f.attrs[key] for key in f.attrs.keys()}
Ok, the technical term for that is "pretty bad".
You seem to contemplate that caller might pass in the empty string.
And then, when that happens, you dereference uninitialized local f
,
giving a NameError.
I imagine that linting with mypy
might have caught that?
Haven't tried it yet.
self.file[path].attrs[attribute[0]] = attribute[1]
This is nice enough. Certainly it works.
It would be nicer if you introduced names for those, via tuple unpack:
k, v = attribute
self.file[path].attrs[k] = v
(Or key, val
, whatever.)
The __init__
constructor is doing kind of a lot
of stuff there.
Maybe banish some of it to a helper? or two?
Think about adding some method .foo()
to this class,
either you or another maintainer.
And think about authoring a new
unit test.
How hard do we want to make it to add such a test?
We need an H5 file before we can even begin
to think about testing some utility function. :-(
The sticking point seems to be having a valid self.file
.
Maybe require caller to pass it in?
A utility function outside the class could help callers with that.
Alternatively, we might have a pair of similar classes,
one which accepts a str
name, the other accepting an existing file Path
.
if isinstance(path, str):
self.path = pathlib.Path(path)
else:
self.path = path
This seems tediously long-winded.
Just unconditionally assign self.path = Path(path)
and be done with it.
And similarly in .extract_data()
.
A redundant call of str("abc")
will similarly
return those three characters unchanged.
We have if NOT .is_file(): A else: B
.
Consider rephrasing it as if .is_file(): B else: A
.
Humans are better at reasoning about code that lacks negation.
(And double negative? Don't even go there.)
This .is_file()
test does not seem robust,
in the sense that H5 and FS have different namespaces,
and there could accidentally be a name collision between them.
So again, allowing the caller to choose his own adventure
might be the better route to follow.
A maintainer would probably be grateful if you
mention the value of f"{root_structure}"
when you .warn()
.
def __enter__(self):
return self
Oh, nice, we have a context manager!
The class """docstring""" probably should have spelled that out.
Certainly this definition is correct. But it's not usual to define such a method, as the contextmanager decorator typically suffices.
if self.file:
self.file.close
I don't understand what's going on here at all.
It seems like Author's Intent must have been .close()
?
def __exit__(self, type, value, traceback) -> None:
I don't understand this signature. Or rather, I don't agree with it, we could easily make it more closely match the documentation.
Please don't shadow the builtin
type.
Prefer to "be boring";
prefer the exc_type
mentioned in the
docs.
if value is not None:
This is actually pretty bad.
I have participated in many code reviews where hopelessly
vague identifiers like this were criticized, though in
some contexts a name of value
makes perfect sense.
Here?
Absolutely not! You must use the
conventional
name of exc_value
so it is clear we're dealing
with propagation of an EXCeption.
Ok, let's come back to that signature.
It's annotated to return None
, that is,
we evaluate for side effects.
What?!?
The documentation's mention of "should return a true value"
makes it clear that ...) -> bool:
is the
appropriate return type.
Did mypy
really agree with this code? Sigh!
(Yes, I understand, dropping off the end of this method
implicitly returns None
which is similar to return False
.
Typically a decorator glosses over these details.)
def __getitem__(self, path: str = "") -> h5py.Group | h5py.Dataset:
Kudos, that's a very helpful signature, thank you.
if path == "":
Ok, I'm starting to get the feeling that empty string
is special to this class, in the way that None
is
often a special input parameter to lots of other classes.
And that's fine, it may be good design.
But it has to be explicitly documented. I haven't seen any """docstring""" call out its special meaning.
I don't know the right answer, but it is possible that you might find it convenient and self-documenting to nominate some other special reserved string to play that role. This is all part of the design of a Public API, and I will be the first to confess that (A) it ain't easy, and (B) likely v2 will get it "more" right than the v1 design.
raise KeyError("Path does not exist in the file.")
Some poor maintainer will read a bug report and then curse you. Prefer:
raise KeyError("Path {path} does not exist in the file.")
return self.file[path][()]
I get the sense we will be returning an h5py.Dataset
here.
But I'd feel better about this code if an if
or an assert
was explicit about that.
Also, I guess I'm not really understanding what's going on
with the optional type annotations in this class.
I see them on some methods.
Indeed, we see that this method assures us path
shall
be a str
. But what of the return type?
I would like to be able to better reason about path
and its de-reference.
In general,
DbC
tells us that every method adheres to a loose or a tight
contract. Here, I am left wondering what the read()
contract is.
conversion_fcts = {
I mentally read that as "conversion facts".
Had I authored the code I likely would have named it conversion_fns
.
Consider spelling out conversion_functions
.
It seems pretty clear that extract_data()
seeks
to impose some post-condition.
It is important to explicitly write down that promise.
nit, typo: "doesn't" for "...file don't have..."
def apply(
self,
function: callable,
inputs: list[str] | str,
output_names: list[str] | str | None = None,
Thank you for the type annotations. They are clear, beautiful, very helpful.
As far as design of Public API goes, I'm not super fond of the whole "could be a list, or maybe a str" thing. Maybe that's the right design, maybe you have it exactly right, and I should pipe down, that's fine. Or maybe your users would be OK with having a singular form, and a plural form, of that method.
The concern is this.
We want to deliberately design a Public API
that is "easy to use", which also means it is
"hard to misuse", hard to misinterpret the result
if bad things happened along the way.
When we admit inputs of diverse types,
maybe we are doing the caller a kindness.
For example, viewing path: str
as equivalent
to Path(path)
seems harmless enough.
The flip side is, insisting on a specific
type can reveal bugs as soon as they're authored.
For example, a caller of def foo(a, b, c, d, e, f, path=None, g=None):
might have 1 too many args, or 1 too few.
Insisting that path
be of type Path
can be helpful to an app engineer who cares
about calling it correctly, perhaps by
running mypy
.
In any event the type annotations,
such as for output_names
, look lovely as written,
and I thank you.
if len(output_names) != len(result):
raise Exception("Error: Unequal amount of outputs and output names")
The """docstring""" does not anticipate this possibility, and does not warn the caller of it. How do I know? Because there is no docstring! This is an essential aspect of the Public API, one that must be documented.
When a maintainer chases a bug report on this, the "unequal" remark will be less helpful than seeing the two integer lengths within the diagnostic.
Do not raise Exception
.
Either use something more specific like ValueError,
or define an app-specific exception class.
if increment_proc or f"{str(num_proc).zfill(3)}-{function.__name__}" not in self.read("process"):
That's a nice expression.
It needs to be extracted into a helper. I can't imagine that's the only place it appears in this codebase.
out_folder_location = f"{'process'}/{str(num_proc).zfill(3)}-{function.__name__}"
Oh, look! I was right! Same expression, sure enough.
for name, data in zip(output_names, result):
When we assigned result = function(*inputs, **kwargs)
I had
no idea what was going on, it was pretty generic.
But now it appears the identifier should have been results
,
since the API contract seems to imply some container
of more than one result value.
If (e.g. with annotations) you can encourage function
to return a certain kind of result value,
then please do that.
Being specific is better than being vague.
It helps us to reason about the code.
You define out_folder_location
,
but then you only reference that plus /name
.
Consider changing how you assign local vars.
"path": out_folder_location + output_name,
I don't understand that expression.
My reading, which likely is wrong, is that location
does not end with a /
slash, and name contains no slash.
It seems like we want a slash between those two.
Sadly this submission contains no unit tests. An example input / output pair would go a long way to documenting the intended behavior of this line of code.
DRY.
In _write_generic_attributes
for the pair of assignments ending with
new_attrs["operation name"] = function.__module__ + "." + function.__name__
prefer this single assignment:
new_attrs["operation name"] = (function.__module__ or "None") + "." + function.__name__
Some of this method's behavior seems "quirky", such as an assignment conditional on this:
if function.__module__ == "__main__":
Describing such behavior in a """docstring""" would go a long way toward making this class easy-to-use for newly onboarded engineers just starting to work with this project.
Embrace sphinx! Publish a "readthedocs" site.
if isinstance(func, types.BuiltinFunctionType):
I fail to see why this is of interest. But I'm just flogging the same old horse that carries a banner saying "there's no docs!". You really need to document this code's behavior.
value = "None"
That str
seems a slightly odd choice, compared to assigning None
.
Ok, whatever.
(It's not like a type hint on attr_dict
told us to expect str
values.)
RuntimeWarning("Attribute was not able to be saved, probably because the attribute is too large")
I don't understand that line at all.
We were storing to a variable of type dict
.
I'm not aware of a way to provoke such an error.
An illuminating # comment
would be helpful.
attr_dict[f"kwargs_{key}"] = value
...
attr_dict["kwargs_" + key] = "None"
Gratuitously using two ways to express the same thing is not a kindness to the Gentle Reader.
def _create_dataset(self, data: tuple[str, any], overwrite=True) -> None:
if data[0] in self.file:
...
self.file.create_dataset(data[0], data=data[1])
The identifier data
is probably the right name to use,
but it is rather vague.
And it has structure, as the type annotation explains.
Please use an x, y = data
tuple unpack,
so you can name the elements.
(I would offer multi-character names, except
that I don't know what appropriate names would be.)
In _write_extracted_data
, you again might want to assign a path temp var.
Here it might be f"datasets/{path.stem}"
.
- context manager, ... would this possibly create conflict with the os trying to access the file?
No, it there's no conflict.
But I agree with you that this class seems to be doing more than one thing. If the submission included a short test suite or other example calls we might be able to tease out two (or more!) distinct use cases that several classes could specialize in.
- I have the feeling that the class is doing too much
Yes, I'm inclined to agree.
But it's hard to evaluate without seeing the typical call pattern. Unit tests would go a long way toward illuminating that. Possibly a pair of similar classes would make sense, but it's hard for me to tell based on just the OP.
- not very happy with the apply function. ... in practice most of the ones passed will manipulate some kind of numpy array, and give as a result also a numpy array.
I completely agree with that sentiment. Again, it comes down to examining how callers actually use the API.
You proposed a pair of restrictions: that a matrix be passed in, and that a matrix be emitted as the result. I agree. Run your test suite, then impose such restrictions, and verify the tests still run Green.
- not sure about the extract_data ... Is this a good way to do it?
Yes, it looked like a good way to parameterize things to me.
Again, """docstrings""" and example calls in a test suite
will go a long way toward advising new callers about how
to take advantage of what your class offers.
In particular, you should carefully explain what the
contract for a user-defined function is -- how the
input matrix relates to the output matrix,
must it be a pure function,
when to complain with a fatal raise
, those sort of concerns.
This code achieves some of its design goals.
Adding unit tests wouldn't hurt. Adding """docstrings""" is essential.
I would not be willing to delegate or accept maintenance tasks
on this codebase in its current form.
But it goes deeper than that.
I would be reluctant to work on code which uses
the HyFile
class as a dependency.
Why? Low quality documentation.
As a caller, it is unclear to me
what contracts my inputs must adhere to,
and what output behavior I can rely on.
EDIT
It turns out some unit tests are lurking within a feature branch.
An attempt import sample
(predictably) fails.
Altering sys.path
at import
time is not a Best Practice.
Instead, chdir
to the right spot before running,
and use the PYTHONPATH env var to affect sys.path
.
After $ poetry add pytest-cov
I got this to work:
cd hystorian/tests/ &&
env PYTHONPATH=.. poetry run pytest --cov --cov-report=term-missing test_io.py
====== test session starts =======
platform darwin -- Python 3.10.10, pytest-7.3.1, pluggy-1.0.0
rootdir: hystorian
plugins: typeguard-2.13.3, cov-4.0.0
collected 9 items
test_io.py .........
--------- coverage: platform darwin, python 3.10.10-final-0 ----------
Name Stmts Miss Cover Missing
-------------------------------------------------------------------
hystorian/hystorian/__init__.py 0 0 100%
hystorian/hystorian/io/__init__.py 0 0 100%
hystorian/hystorian/io/gsf_files.py 34 4 88% 23-24, 59, 61
hystorian/hystorian/io/hyFile.py 154 72 53% 38, 41-46, 74, 97-98, 109, 113, 135, 139, 141, 152, 158, 168-202, 207, 212-231, 236-261, 272
hystorian/hystorian/io/ibw_files.py 44 39 11% 12-65
test_io.py 83 3 96% 115-116, 133
-------------------------------------------------------------------
TOTAL 315 118 63%
===== 9 passed in 1.45s =====
I would typically record such a command in a Makefile,
so $ make test
will conveniently run it
and so newly onboarded engineers will see it and learn from it.
Similarly, $ make lint
in any of my projects will
tend to run something like isort . && black -S . && flake8
,
perhaps followed by a mypy
invocation.
Sadly, the usual command of $ pytest tests/*.py
does not work, since the tests want to be run
from one level down.
You will want to add tests/test_files/test.hdf5
to .gitignore
I briefly looked at a few tests and I like them, as they are short, boring, simple, and direct, just what we want to see in tests.
You might wish to assign a popular constant,
such as "datasets/test_data"
,
to a variable.
(Or not. Copy-n-paste in test code is perfectly
fine, it's the one place you can get away with that.)
I have not examined the uncovered lines, e.g. lines 23 and 59, and I would not necessarily shoot for 100% coverage. It's up to you. We spend time writing unit tests in order to save time, both to save an individual contributor time and to save the project time. If you feel covering a line of target code is worthwhile, go for it. If not, that's fine, as well.
I tend to view coverage through this lens: Do I still need that line of target code? Why is it "hard" to call it, to cover it? Could I alter the Public API so testing is easier?
-
\$\begingroup\$ Thanks a lot for your very thorough constructive comment! You will be happy to know that I am writing unittests. You can find them here: github.com/Coilm/hystorian/tree/rework. However, I am pretty sure the current tests I wrote could deserve their own post on this forum since it is the first time in my life that I write testing. I'll implement all your comments and keep them in mind for the future. I am new to StackExchange: What is the best practice if I want to discuss more about some of your comments? \$\endgroup\$CoilM– CoilM2023年04月29日 07:15:04 +00:00Commented Apr 29, 2023 at 7:15
-
\$\begingroup\$ Adding comments here is a fine practice, at least for a couple of back-n-forths. Unlike SO, the CodeReview site tends to frown on edits once an answer has been submitted, so that readers a year from now can see matching review + reviewed code. If you feel it would be helpful, feel free to request review of specific test(s) in a new posting. \$\endgroup\$J_H– J_H2023年04月29日 18:29:44 +00:00Commented Apr 29, 2023 at 18:29