I wanted to have something more convenient to work with nested dictionaries that come from yaml
files so I wrote this DictIndexter
. Its __getitem__
and __setitem__
accept a path instead of a single key. Additionally the getter can also take a tuple with path and default value. The setter on the other hand is able to create new branches, but only when it's a clean branch. This is a branch assigned to a non-existing key. It wont overwrite another value.
Implementation
class DictIndexer:
def __init__(self, source: Dict[str, Any], set_creates_branches: bool = True):
self.source = source
self.set_creates_branches = set_creates_branches
def __getitem__(self, path: str | Tuple[str, Any]) -> Any | None:
path, default = path if isinstance(path, tuple) else (path, None)
item = self.source
for key in DictIndexer._split(path):
item = item.get(key, None)
if item is None:
return default
else:
return item
def __setitem__(self, path: str, value: Any):
class Missing:
pass
names = DictIndexer._split(path)
prev = self.source
item = self.source
for key in names[:-1]:
item = item.get(key, Missing())
match item:
case dict():
prev = item
case Missing():
if self.set_creates_branches:
prev[key] = {}
else:
raise KeyError(f"Cannot create branch '{key}. Disabled")
case _:
raise KeyError(f"Cannot overwrite '{key}.")
else:
item[names[-1]] = value
@staticmethod
def _split(path: str) -> List[str]:
return re.split(r"[/\\]", path)
Tests
With pytest
I made sure that the following features work:
def test_can_get_item_by_path():
data = {
"foo": {
"bar": "baz"
}
}
assert DictIndexer(data)["foo/bar"] == "baz"
def test_gets_null_when_last_name_does_not_exist():
data = {
"foo": {
"bar": "baz"
}
}
assert DictIndexer(data)["foo/baz"] is None
def test_gets_null_or_default_when_middle_name_does_not_exist():
data = {
"foo": {
"bar": {
"baz": "baz"
}
}
}
assert DictIndexer(data)["foo/baz/baz"] is None
assert DictIndexer(data)[("foo/baz/baz", "qux")] == "qux"
def test_can_set_item():
data = {
"foo": {
"bar": {
"baz": "baz"
}
}
}
DictIndexer(data)["foo/bar/baz"] = "qux"
assert DictIndexer(data)["foo/bar/baz"] == "qux"
def test_can_set_item_on_new_branches():
data = {
"foo": {
"bar": {
"baz": {
"foo": "bar"
}
}
}
}
DictIndexer(data)["foo/bar/baz/qux"] = "qux"
assert DictIndexer(data)["foo/bar/baz/qux"] == "qux"
def test_does_not_create_new_branches_when_disabled():
data = {
"foo": {
"bar": {
}
}
}
with pytest.raises(KeyError):
DictIndexer(data, set_creates_branches=False)["foo/bar/baz/qux"] = "qux"
def test_does_not_overwrite_value_with_dict():
data = {
"foo": {
"bar": {
"baz": "baz"
}
}
}
with pytest.raises(KeyError):
DictIndexer(data)["foo/bar/baz/qux"] = "qux"
Example
I currently use it to customize the logger cofiguration:
def _initialize_logging():
import logging.config
config = DictIndexer(assets.cfg("wiretap"))
config["formatters/auto/()"] = wiretap.MultiFormatter
config["formatters/auto/./values"] = [f"v{app.VERSION}-{os.environ['mode'][0]}"]
config["handlers/file/filename"] = assets.home("log", f"{app.ID}.log")
config["handlers/sqlserver/insert"] = assets.sql("wiretap", "insert log.sql")
logging.config.dictConfig(config.source)
Why do you think? What would you improve?
-
\$\begingroup\$ You offer both {getter, setter}, cool! FYI, here is a nice package focused on the getter use case: pypi.org/project/glom \$\endgroup\$J_H– J_H2023年03月29日 16:30:29 +00:00Commented Mar 29, 2023 at 16:30
3 Answers 3
Overview
The code looks good, properly typed, tested. This is a great starting point.
Here are various details can could be improved.
Documentation
Your code could definitely be improved with docstrings.
Imports
This is more about the StackExchange question than about the code itself but providing the imports can make testing/reviewing easier for others.
from typing import Dict, Any, Tuple, List
import re
Tests: splitting the setup from the logic being tested
This is mostly a matter of personal preference but I find a line such as assert DictIndexer(data)[("foo/baz/baz", "qux")] == "qux"
a bit hard to grasp. The test cases can probably be made easier to read if the DictIndexer
was created as an initial step and then used to perform queries.
We'd get something like:
def test_gets_null_or_default_when_middle_name_does_not_exist():
data = DictIndexer({
"foo": {
"bar": {
"baz": "baz"
}
}
})
assert data["foo/baz/baz"] is None
assert data[("foo/baz/baz", "qux")] == "qux"
This is even more relevant when with pytest.raises(KeyError)
is used as you definitely do not want the constructor of DictIndexer
to be the part raising the error.
Splitting
Your method to split relies on regexp. It seems slightly overkill in this particular case as you are using a single character for splitting and str.split
would do the trick.
Also, you could make your code slightly more generic by providing the separator in the constructor.
def __init__(self, source: Dict[str, Any], set_creates_branches: bool = True, separator='/'):
self.source = source
self.set_creates_branches = set_creates_branches
self.separator = separator
(...)
def _split(self, path: str) -> List[str]:
return path.split(self.separator)
Getting elements with default value
I find the way the default value can be provided to __getitem__
slightly awkward.
My preference would be to rely on how things are done for dict
:
d[key]
: Return the item of d with key key. Raises a KeyError if key is not in the map.get(key[, default])
: Return the value for key if key is in the dictionary, else default. If default is not given, it defaults to None, so that this method never raises a KeyError.
def get(self, path, default=None):
try:
return self[path]
except KeyError:
return default
def __getitem__(self, path: str) -> Any | None:
item = self.source
for key in self._split(path):
item = item.get(key, None)
if item is None:
raise KeyError(f"Key '{key} not found.")
else:
return item
else
for for
loop
The optional else
for for
loops is a nice idiom which deserves some love and can be used to perform an operation in the case where break
did not occur in the loop and the loops stops in a "normal" way.
In our particular case, there is no break
involved at all so I'm not so sure that it is relevant.
Tuple unpacking using asterisks
Instead of messing with indices with names
in __setitem__
, you could directly unpack it into the parts you are interested in: *names, last_name = self._split(path)
.
Final code
At this stage, the code looks like this:
from typing import Dict, Any, Tuple, List
class DictIndexer:
def __init__(self, source: Dict[str, Any], set_creates_branches: bool = True, separator='/'):
self.source = source
self.set_creates_branches = set_creates_branches
self.separator = separator
def get(self, path, default=None):
try:
return self[path]
except KeyError:
return default
def __getitem__(self, path: str) -> Any | None:
item = self.source
for key in self._split(path):
item = item.get(key, None)
if item is None:
raise KeyError(f"Key '{key} not found.")
return item
def __setitem__(self, path: str, value: Any):
class Missing:
pass
*names, last_name = self._split(path)
prev = self.source
item = self.source
for key in names:
item = item.get(key, Missing())
match item:
case dict():
prev = item
case Missing():
if self.set_creates_branches:
prev[key] = {}
else:
raise KeyError(f"Cannot create branch '{key}. Disabled")
case _:
raise KeyError(f"Cannot overwrite '{key}.")
item[last_name] = value
def _split(self, path: str) -> List[str]:
return path.split(self.separator)
def test_can_get_item_by_path():
data = DictIndexer({
"foo": {
"bar": "baz"
}
})
assert data["foo/bar"] == "baz"
def test_gets_null_when_last_name_does_not_exist():
data = DictIndexer({
"foo": {
"bar": "baz"
}
})
try:
data["foo/baz"]
except KeyError:
pass
assert data.get("foo/baz") is None
def test_gets_null_or_default_when_middle_name_does_not_exist():
data = DictIndexer({
"foo": {
"bar": {
"baz": "baz"
}
}
})
try:
data["foo/baz/baz"]
except KeyError:
pass
assert data.get("foo/baz/baz", "qux") == "qux"
def test_can_set_item():
data = DictIndexer({
"foo": {
"bar": {
"baz": "baz"
}
}
})
assert data["foo/bar/baz"] != "qux"
data["foo/bar/baz"] = "qux"
assert data["foo/bar/baz"] == "qux"
def test_can_set_item_on_new_branches():
data = DictIndexer({
"foo": {
"bar": {
"baz": {
"foo": "bar"
}
}
}
})
data["foo/bar/baz/qux"] = "qux"
assert data["foo/bar/baz/qux"] == "qux"
def test_does_not_create_new_branches_when_disabled():
data = DictIndexer({
"foo": {
"bar": {
}
}
}, set_creates_branches=False)
try:
data["foo/bar/baz/qux"] = "qux"
except KeyError:
pass
def test_does_not_overwrite_value_with_dict():
data = DictIndexer({
"foo": {
"bar": {
"baz": "baz"
}
}
})
try:
data["foo/bar/baz/qux"] = "qux"
except KeyError:
pass
Edit
Something that I had forgotten: now that __getitem__
throws, one could directly use:
def __getitem__(self, path: str) -> Any | None:
item = self.source
for key in self._split(path):
item = item[key]
return item
Similarly, we could imagine getting rid of the Missing
class and work with exceptions directly:
def __setitem__(self, path: str, value: Any):
*names, last_name = self._split(path)
prev = self.source
item = self.source
for key in names:
try:
item = item[key]
except KeyError:
if self.set_creates_branches:
prev[key] = {}
continue
else:
raise KeyError(f"Cannot create branch '{key}. Disabled")
match item:
case dict():
prev = item
case _:
raise KeyError(f"Cannot overwrite '{key}.")
item[last_name] = value
-
\$\begingroup\$ These are great advices! I've learnt some new tricks and better practices. Thanks! \$\endgroup\$t3chb0t– t3chb0t2023年03月28日 08:56:17 +00:00Commented Mar 28, 2023 at 8:56
-
1\$\begingroup\$ @t3chb0t I'm glad you like it! FYI, I've added an additional suggestion. \$\endgroup\$SylvainD– SylvainD2023年03月28日 09:58:30 +00:00Commented Mar 28, 2023 at 9:58
-
\$\begingroup\$ I like it very much. Code must be like art and now it is ;-] \$\endgroup\$t3chb0t– t3chb0t2023年03月28日 16:07:23 +00:00Commented Mar 28, 2023 at 16:07
Bug
I'm not liking some code that I see in __setitem__
, so let's run a few more tests:
>>> data = {"foo": {}}
>>> DictIndexer(data)["foo/bar/baz/qux"] = "qux"
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
DictIndexer(data)["foo/bar/baz/qux"] = "qux"
File "C:/Users/aneufeld/Documents/Experiments/DictIndex.py", line 27, in __setitem__
item = item.get(key, Missing())
AttributeError: 'Missing' object has no attribute 'get'
Yup. Creating more than one missing path segment is definitely a problem.
>>> data
{'foo': {'bar': {}}}
It has only created the first missing key & dictionary. So let's try a shorter test:
>>> data = {"foo": {}}
>>> DictIndexer(data)["foo/bar/qux"] = "qux"
Traceback (most recent call last):
File "<pyshell#8>", line 1, in <module>
DictIndexer(data)["foo/bar/qux"] = "qux"
File "C:/Users/aneufeld/Documents/Experiments/DictIndex.py", line 39, in __setitem__
item[names[-1]] = value
TypeError: 'Missing' object does not support item assignment
Interesting. That fails in a slightly different way.
Let's look at the code:
prev = self.source
item = self.source
for key in names[:-1]:
item = item.get(key, Missing())
match item:
case dict():
prev = item
case Missing():
if self.set_creates_branches:
prev[key] = {}
else:
raise KeyError(f"Cannot create branch '{key}. Disabled")
case _:
raise KeyError(f"Cannot overwrite '{key}.")
Yup. We initializes prev
, then we have a loop the update the value of prev
at each iteration ... except when a missing key is encountered, something can be done and the loop goes on to the next iteration without prev
being updated.
Wait ... no ... the error messages are talking about the Missing
object! This is not the bug I was expecting; it is an entirely different bug!
item
gets left as Missing()
and,
- if it was the last iteration we can't set the final key value into
item
, - if it wasn't the last iteration, we can't look up the next path key from
item
.
When you create a new branch, you need to update prev
to the newly created dictionary (the bug I thought I saw), but you also need to set item
to that same newly created dictionary ...
...
item = item.get(key, Missing()) # Intended action: update item
match item:
case dict():
prev = item # Prep for next loop
case Missing():
if self.set_creates_branches:
prev[key] = {} # Create missing dictionary
item = prev.get(key) # <-- Fix: Replicate intended action
prev = item # <-- Fix: prep for next loop
...
We can re-work this to remove inefficiencies. Note that before and after this code block, prev
and item
are the same, so we can move prev
entirely inside the loop.
item = self.source
for key in names[:-1]:
prev = item
item = item.get(key, Missing())
match item:
case dict():
pass
case Missing():
if self.set_creates_branches:
item = prev[key] = {}
else:
raise KeyError(f"Cannot create branch '{key}. Disabled")
case _:
raise KeyError(f"Cannot overwrite '{key}.")
else:
item[names[-1]] = value
Testing
Tests are good! I like testing!
But, are you really testing the right things?
def test_can_set_item():
data = {
"foo": {
"bar": {
"baz": "baz"
}
}
}
DictIndexer(data)["foo/bar/baz"] = "qux"
assert DictIndexer(data)["foo/bar/baz"] == "qux"
Have any other keys/subkeys been added to data
? Any been accidentally removed? Yes, the key you are setting is tested to verify getting it returns the value you just set, but is that the only effect?
assert data == {"foo": {"bar": {"baz": "qux"}}}
Now we're testing data
is properly modified.
-
3\$\begingroup\$ This review contains solid advice that the OP would be wise to follow: when testing, don't assert so narrowly that you effectively assume your code's correctness. I re-learned this lesson recently. A small project I'm working on had accumulated a couple dozen tests that were too repetitive. I wrote an assertion-making helper function that most of the tests were able to use. I thought I was just getting rid of code repetition, but there was an unanticipated side effect: for several tests, use of the helper widened their assertion-making considerably – exposing a couple of edge-case bugs! \$\endgroup\$FMc– FMc2023年03月29日 16:45:36 +00:00Commented Mar 29, 2023 at 16:45
Your __getitem__
function indicates two conflicting ideas.
You call out
__getitem__
can returnNone
.def __getitem__(...) -> Any | None:
Your function cannot return
None
unlessdefault
isNone
.item = item.get(key, None) if item is None: return default
>>> DictIndexer({"a": None})["a", "default value"] default value
Because of the code in the second item, your class doesn't work in a similar way to dict.get
.
A fairly large Python idiom is EAFP.
To easily get the same interface as dict.get
is really simple with EAFP.
>>> {"a": None}.get("a", "default value")
None
def __getitem__(self, path: str | Tuple[str, Any]) -> Any | None:
path, default = path if isinstance(path, tuple) else (path, None)
try:
item = self.source
for key in self._split(path):
item = item[key]
return item
except KeyError:
return default
I can also see a potential for conflicting ideas on type hints - Any | None
is Any
.
I did things like Any | None
when I wanted to make the class generic but didn't fully know how.
So I'm going to go through an over-kill entirely generic class.
First we can define a recursive generic alias for a dictionary of dictionaries or values. Basically the dict tree we're interacting with.
TKey = TypeVar("TKey")
TValue = TypeVar("TValue")
ATree: TypeAlias = dict[TKey, TValue] | TValue | dict[TKey, "ATree[TKey, TValue]"]
From here we can define a generic class which takes a Tree
as input.
And we'll define _split
's type hints to remove errors with mypy.
class DictIndexer(Generic[TKey, TValue]):
def __init__(self, source: Tree[TKey, TValue]) -> None:
self.source = source
@classmethod
def _split(cls, path: TKey) -> Iterable[TKey]:
raise NotImplementedError(f"{cls.__name__}._split has not been defined")
And then finally use overload
to completely type __getitem__
.
There are two paths:
- no default specified so the type is
None
, or - a default type is provided which we should union with.
TDefault = TypeVar("TDefault")
...
@overload
def __getitem__(self, path: TKey | tuple[TKey]) -> TValue | None: ...
@overload
def __getitem__(self, path: tuple[TKey, TDefault]) -> TValue | TDefault: ...
def __getitem__(self, path: Any) -> Any:
From here we can see the generic types for DictIndexer
are inferred by the dictionary literal.
And we can see __getitem__
correctly sets the correct return types.
tree = DictIndexer({"a": None})
reveal_type(tree)
reveal_type(tree["a"])
reveal_type(tree["a", "default value"])
>>> mypy foo.py
foo.py:36: note: Revealed type is "foo.DictIndexer[builtins.str, None]"
foo.py:37: note: Revealed type is "None"
foo.py:38: note: Revealed type is "Union[None, builtins.str]"
Your __setitem__
is far less EAFP than __getitem__
.
If we ignore the "able to create new branches, but only when it's a clean branch" part we can make a very simple EAFP function.
Lets look at the edge-cases:
-
>>> {}["a"] KeyError: 'a'
-
>>> None["a"] TypeError: 'NoneType' object is not subscriptable
So we can quite easily rewrite __setitem__
in the exact same way we did with __getitem__
.
def __setitem__(self, path: TKey, value: TValue) -> None:
*head, tail = key, *_ = self._split(path)
try:
item = self.source
for key in head:
item = item[key]
except KeyError:
raise KeyError(f"Cannot create branch '{key}. Disabled") from None
except TypeError:
raise KeyError(f"Cannot overwrite '{key}.") from None
else:
item[tail] = value
Now we can change __setitem__
to support the dict.setdefault
like behaviour.
try:
item = self.source
if not self.set_creates_branches:
for key in head:
item = item[key]
else:
for key in head:
try:
item = item[key]
except KeyError:
item[key] = item = {}
except KeyError:
from typing import Any, Generic, Iterable, TypeAlias, TypeVar, overload
TKey = TypeVar("TKey")
TValue = TypeVar("TValue")
TDefault = TypeVar("TDefault")
ATree: TypeAlias = dict[TKey, TValue] | TValue | dict[TKey, "ATree[TKey, TValue]"]
class BaseDictIndexer(Generic[TKey, TValue]):
def __init__(
self,
source: ATree[TKey, TValue],
set_creates_branches: bool = True,
) -> None:
self.source = source
self.set_creates_branches = set_creates_branches
@classmethod
def _split(cls, path: TKey) -> Iterable[TKey]:
raise NotImplementedError(f"{cls.__name__}._split has not been defined")
@overload
def __getitem__(self, path: TKey | tuple[TKey]) -> TValue | None: ...
@overload
def __getitem__(self, path: tuple[TKey, TDefault]) -> TValue | TDefault: ...
def __getitem__(self, path: Any) -> Any:
path, default = path if isinstance(path, tuple) else (path, None)
try:
item = self.source
for key in DictIndexer._split(path):
item = item[key] # type: ignore
return item
except KeyError:
return default
def __setitem__(self, path: TKey, value: TValue) -> None:
*head, tail = key, *_ = self._split(path)
try:
item = self.source
if not self.set_creates_branches:
for key in head:
item = item[key] # type: ignore
else:
for key in head:
try:
item = item[key] # type: ignore
except KeyError:
item[key] = item = {} # type: ignore
except KeyError:
raise KeyError(f"Cannot create branch '{key}'. Disabled") from None
except TypeError:
raise KeyError(f"Cannot overwrite '{key}'.") from None
else:
item[tail] = value # type: ignore
class DictIndexer(BaseDictIndexer[str, TValue]):
@staticmethod
def _split(path: str) -> list[str]:
return path.split("/")
# tree = DictIndexer({"a": None}, set_creates_branches=False)
tree = DictIndexer({"a": None})
print(tree["a"])
tree["b/c/d"] = None
print(tree["b/c/d"])
print(tree.source)
-
\$\begingroup\$ This is rocket-science for me, but I now see how powerful generics can be so I'll dig into that. \$\endgroup\$t3chb0t– t3chb0t2023年03月29日 11:03:55 +00:00Commented Mar 29, 2023 at 11:03
-
\$\begingroup\$ I'm studying your suggestions and I'm currently stuck on
Tree[TKey, TValue]
that I wasn't able to crack. You defined it as aTypeAlias
(with a typo asATree
) from what I understand it can be either one of the tree types separated by the pipe|
. What I don't understand is why isTree
taking two generic arguments? Is it because you use the first of the types in the list this isdict[TKey, TValue]
as the source or is it becuase the types within the alias use generic types so you need to specify them in the declaration. I tend to think this is the latter. \$\endgroup\$t3chb0t– t3chb0t2023年04月01日 10:06:08 +00:00Commented Apr 1, 2023 at 10:06 -
\$\begingroup\$ Then in the last type
Dict[TKey, "Tree[TKey, TValue]"]
you use a string. I guess this is because of the recursion or self-reference. I'm typing it step by step in PyCharm and it recognizes this correctly. I think I've found it in the docs. It's called a forward reference, right? \$\endgroup\$t3chb0t– t3chb0t2023年04月01日 10:13:47 +00:00Commented Apr 1, 2023 at 10:13 -
1\$\begingroup\$ 1. I prefix all
TypeAlias
' withA
. I also prefixProtocol
s withI
(like C#) and suffix ABCs withBase
. 2. Lets pretendATree
isATree = dict[TKey, TValue]
. When we're usingATree
we're effectively just usingdict
. So when you pass the types todict
you need both the key and the value types. So because we've defined two generic types in the type alias we need to specify all the generics. -- An interesting side note is if youATree = TValue | dict ...
the arguments toATree
will be flipped toATree[TValue, TKey]
. Yes the string is a forward reference. \$\endgroup\$2023年04月01日 16:39:28 +00:00Commented Apr 1, 2023 at 16:39 -
1\$\begingroup\$ @t3chb0t 4. Multiple assignment acts from left to right.
item[key] = item = {}
assignsitem[key]
first, then assignsitem
. If you instead doitem = item[key] = {}
then you're doingitem = {}
,item[key] = item
, which should obviously become the infinitely recursive dict you see. 3. Your result doesn't seem right, I gethead=[1, 2, 3, 4, 5], tail=6, key=1, x=[2, 3, 4, 5, 6]
. Noticetail=6
nottail=1
. As mentioned before multiple assignments act right to left. Additionally the resulting object in tuple unpacking is the entire tuple. We haven't sliced the tuple. \$\endgroup\$2023年04月01日 16:48:24 +00:00Commented Apr 1, 2023 at 16:48