This is an actual work problem we had to solve. Put simply: given a structure (e.g. nested dictionaries) and a mapping of old dictionary keys to new ones, produce a new structure that is anatomically identical to the original, uses the new dictionary keys, and preservers every other value.
- How to encode this mapping?
- How to go about the translation?
Context
We receive these dictionaries in the form of json files through an API and, because of extraneous constraints, the sender doesn't have access to our internal nomenclature system. So we need to convert the names ourselves.
Assembling the mappings is quite a laborious manual effort, as it envolves figuring out semantics and talking to people. We are obviously working on a better solution, but these contraints will hold us for a while longer.
Details
Suppose a system which receives json messages such as
msg = {
"id": 1,
"summary": {
"origin": {
"url": "url",
"slug": "slug"
},
"tags": ["a", "b"]
},
"items": [
{
"id": "abc",
"price": 50
},
{
"id": "def",
"price": 110,
"discount": 50
}
]
}
But in order to move the data forward, the names of the dictionary keys must follow a specific nomenclature. So they must be translated, like so:
translated_msg = {
"IDENTIF": 1,
"SUMM": {
"ORIG": {
"WEBADDRESS": "url",
"LOCATOR": "slug"
},
"TAGS": ["a", "b"]
},
"PURCHASEDGOODS": [
{
"GOODSID": "abc",
"GOODSPRICE": 50
},
{
"GOODSID": "def",
"GOODSPRICE": 110,
"GIVENDISCOUNT": 50
}
]
}
The new terminology comes from a translation dictionary that has to be manually built by someone who is familiar with the data and the nomenclature to be followed. This field map must also encode the anatomy of the original structure because there may be multiple fields with the same name but in different depths. Notice the two id
fields above.
Solution
With all this in mind, here is a field map structure which fits the criteria. Its syntax is part of the solution I came up with and can be modified.
field_map = {
"/id": "IDENTIF",
"/summary": "SUMM",
"/summary/origin": "ORIG",
"/summary/origin/url": "WEBADDRESS",
"/summary/origin/slug": "LOCATOR",
"/summary/tags": "TAGS",
"/items": "PURCHASEDGOODS",
"/items//id": "GOODSID",
"/items//price": "GOODSPRICE",
"/items//discount": "GIVENDISCOUNT",
}
Notice /items//discount
has two slashes in the middle. Slashes represent going deeper one level within the structure.
Inspired by https://stackoverflow.com/a/40857703/10504841, here is a recursive solution that, given a structure and a field map, walks through the entire structure and builds a translated copy:
from typing import Iterable, Union
def is_valid_iterable(struct):
return isinstance(struct, Iterable) and not isinstance(
struct, (str, bytes)
)
def is_key_in_dict(key, dict_):
try:
_ = dict_[key]
return True
except KeyError:
return False
def translate_nested_structure(
structure: Union[dict, list, tuple], trans_dict: dict, prefix: str = ""
) -> Union[dict, list, tuple]:
"""
Translate dictionary keys in a nested structure using a translation
dictionary. Maintains the same strucutre and primitive values.
Useful for translating jsons and avro messages
If a key is present in the structure but not in the translation dictionary,
it is understood as undesired and removed from the output structure
If a (sub)structure is made of only lists or tuples, the output
is simply a copy of the given (sub)structure
Supported types and content limitation for dictionary (sub)structures
------------------------------------------------------
Key fields can be of any primitive type or None.
Tuple keys are somewhat supported, but not fully tested and not documented.
"/" are not allowed inside string keys, see translation dictionary syntax
Value field can be lists, tuples, dicts, any primitive or None
Translation dictionary syntax
------------------------------
The translation dictionary must capture the anatomy of the nested
structure, as different nested keys may share the same name.
The syntax for the translation dictionary keys is made of
"/"s and `orig_key`s.
"/" are used to indicate going deeper whithin the strucure,
so "/" may not be present inside string keys in the structure.
Also, the number of preceding "/" should match the nesting level
of the (sub)structure
`orig_key` are pieces of string which contain
the name of the specified original key in the structure.
The syntax for the keys is easier to understand if thought of backwards:
every key must end with an `orig_key`, since those are
what need to be translated. A single preceding "/"
indicates `orig_key` is key a inside another dicionary
(e.g. "/start/in_a_dict`). In this case,
unless `orig_key` is the first key (e.g. "/test"), then "/"
must be preced by another `orig_key (e.g. "/start/test`).
Multiple preceding "/" indicate `orig_key` is in a
list or tuple (e.g. "/start//in_a_list", "//start").
Since the translation dictionary values contain the desired
new translated (sub)structure keys, the syntax and supported types are
the same as the original structure syntax for keys. See above
Parameters
----------
structure: [dict | list | tuple]
Nested dict, list or tuple.
trans_dict: dict
Translation dictionary, see example below.
prefix: str
Prefix used to find keys in the translation dictionary, leave blank
Returns
-------
translated_structure: [dict, list, tuple]
Same structure, but with translated dictionary keys
Examples
--------
>>> sample_msg = {
... "a": {
... "b": ["c", "d"],
... "e": [
... {
... "f": {"g": "h"},
... },
... {
... "f": {"g": "h", "g2": "h2"},
... },
... ],
... "i": None,
... "j": [],
... },
... }
>>> sample_translated_msg = {
... "aaaa": {
... "bbbb": ["c", "d"],
... "eeee": [
... {
... "ffff": {"gggg": "h"},
... },
... {
... "ffff": {"gggg": "h", "gggg2222": "h2"},
... },
... ],
... "iiii": None,
... "jjjj": [],
... },
... }
>>> sample_field_map = {
... "/a": "aaaa",
... "/a/b": "bbbb",
... "/a/e": "eeee",
... "/a/e//f": "ffff",
... "/a/e//f/g": "gggg",
... "/a/e//f/g2": "gggg2222",
... "/a/i": "iiii",
... "/a/j": "jjjj",
... }
>>> translated_msg = translate_nested_structure(
... sample_msg, sample_field_map
... )
>>> translated_msg == sample_translated_msg
True
TODO
----
- Improve the trans dict syntax?
"""
def translate_dict(dict_struct, trans_dict, prefix=""):
if not isinstance(dict_struct, dict):
raise TypeError("Expect dict, received %s", type(dict_struct))
new_dict = dict()
for key, value in dict_struct.items():
new_prefix = "/".join([prefix, str(key)])
if not is_key_in_dict(new_prefix, trans_dict):
continue
new_key = trans_dict[new_prefix]
if is_valid_iterable(value):
new_value = translate_nested_structure(
value, trans_dict, new_prefix
)
else:
new_value = value
new_dict[new_key] = new_value
return new_dict
def translate_simple_struct(simple_struct, trans_dict, prefix=""):
if not isinstance(simple_struct, (list, tuple)):
raise TypeError(
"Expect list or tuple, received %s", type(simple_struct)
)
cls_ = type(simple_struct)
new_simple_struct = cls_([])
for item in simple_struct:
new_prefix = "/".join([prefix, ""])
if is_valid_iterable(item):
new_item = translate_nested_structure(
item, trans_dict, new_prefix
)
else:
new_item = item
new_simple_struct += cls_([new_item])
return new_simple_struct
if isinstance(structure, dict):
return translate_dict(structure, trans_dict, prefix)
else:
return translate_simple_struct(structure, trans_dict, prefix)
About tuples as dicitonary keys. I tested a bit and it is possible to encode tuples in the current version of the field map encoding, but the syntax can become quite complicated, so I decided to leave them out for now. The encoding should be as human friendly as possible.
- What are your thoughts on the code itself?
- Do you have any suggestions on how to improve the encoding syntax?
- What about increasing the level of abstraction and supporting more structures, such as sets, classes or custom Iterables?
- I'd also like to hear if other people face similar problems. How often, it at all, do people need to translate dictionary keys like this?
3 Answers 3
How to encode this mapping?
Not the way you've done it, I think. Zen says explicit is better than implicit, and your current mapping is highly implicit. You have a magic double-slash to indicate a list level, and you have an O(n2) problem with your key expressions. These are avoidable problems: don't think of your mapping as being flat, over-the-wire JSON data; think of it as well-typed, well-structured in-memory data. There's no reason for you to write a parsing layer if you don't need it.
Aside: translating from one dict-lasagna domain to another is evidence of a broader, more severe problem with lack of good models (or perhaps no models at all), but you have not shown enough other code for this to be talked about meaningfully.
If what you say is true and these data come directly from JSON, then you need to drop the code that cares about tuples because these will never happen.
Picking up on a few granular review issues (though perhaps these are moot since I'm suggesting that you throw all of the existing code away):
is_valid_iterable
should onlyisinstance(struct, (dict, list))
is_key_in_dict
needs to die, and the call needs to be replaced withkey in some_dict
Suggested
A re-thought mapping could make use of simple polymorphism, with nary an isinstance
in sight:
from dataclasses import dataclass, field
from typing import Any, Union, Optional
Payload = Union[dict[str, Any], list[Any]]
@dataclass
class Node:
replacement: Optional[str] = None
def translate(self, structure: Payload) -> Payload:
return structure
@dataclass
class DictNode(Node):
nodes: dict[str, 'Node'] = field(default_factory=dict)
def translate(self, structure: Payload) -> Payload:
translated = {}
for key, value in structure.items():
translator = self.nodes.get(key)
if translator:
key = translator.replacement or key
value = translator.translate(value)
translated[key] = value
return translated
class ListNode(DictNode):
def translate(self, structure: Payload) -> Payload:
return [
super(ListNode, self).translate(item)
for item in structure
]
def test() -> None:
from pprint import pprint
msg = {
'id': 1,
'items': [{'id': 'abc', 'price': 50},
{'discount': 50, 'id': 'def', 'price': 110}],
'summary': {'origin': {'slug': 'slug', 'url': 'url'}, 'tags': ['a', 'b']}
}
field_map = DictNode(nodes={
'id': Node('IDENTIF'),
'summary': DictNode('SUMM', {
'origin': DictNode('ORIG', {
'url': Node('WEBADDRESS'),
'slug': Node('LOCATOR'),
}),
'tags': Node('TAGS'),
}),
'items': ListNode('PURCHASEDGOODS', {
'id': Node('GOODSID'),
'price': Node('GOODSPRICE'),
'discount': Node('GIVENDISCOUNT'),
}),
})
pprint(field_map.translate(msg))
if __name__ == '__main__':
test()
Output
{'IDENTIF': 1,
'PURCHASEDGOODS': [{'GOODSID': 'abc', 'GOODSPRICE': 50},
{'GIVENDISCOUNT': 50, 'GOODSID': 'def', 'GOODSPRICE': 110}],
'SUMM': {'ORIG': {'LOCATOR': 'slug', 'WEBADDRESS': 'url'}, 'TAGS': ['a', 'b']}}
I'd also hear if other people face problems similar to these. How often do people need to translate dictionary keys like this?
I'd say it's very unusual. In my experience, such dicts are either constructed by jsons or similar to give users / admins a friendly way to script without having any programming knowledge - and internally changing the keys makes no sense, except to increase complexity.
The other way dicts are used is in a programming context where associated data must be stored together. In this context, mostly constants are used as the keys, or input that stays constant. Again, internally changing the keys makes no sense.
The one purpose that I would see dicts used in this way is when the dict is used as a control mechanism, similar to a script engine but completely defined and used by developers. It can make some actual code look extremely neat and clean, however in my opinion it goes against the principle to make code explicit - and therefore decreases readability and understandability.
-
\$\begingroup\$ I see your point but I am not familiar with script engines. Can you point me in the right direction? This is what comes up when I google it stackoverflow.com/q/1691201/10504841 and docs.oracle.com/javase/7/docs/api/javax/script/… \$\endgroup\$pbsb– pbsb2022年06月17日 22:19:26 +00:00Commented Jun 17, 2022 at 22:19
-
\$\begingroup\$ I think I also failed to contextualize properly. This is an actual work problem we had to solve. I added some more details to the question \$\endgroup\$pbsb– pbsb2022年06月17日 22:20:16 +00:00Commented Jun 17, 2022 at 22:20
-
\$\begingroup\$ @pbsb I don't know the actual terminology, but I had a project once where I noticed that the behaviour of my program was highly dependent and similar to the data I input and my configs. So much that I had lambdas in dicts and could 'script' entire behaviours with JSON files only. The code mostly worked with the dicts to transition between states and execute calls to other programs that were defined in said JSON. I named it script engine since it took in some 'script' - JSON - and executed code dependent on it. Not entirely like an interpreter, but a sized down, specialized version. \$\endgroup\$lukstru– lukstru2022年06月18日 12:39:54 +00:00Commented Jun 18, 2022 at 12:39
-
\$\begingroup\$ TLDR: A sized down, specialized version of an interpreter. \$\endgroup\$lukstru– lukstru2022年06月18日 12:40:10 +00:00Commented Jun 18, 2022 at 12:40
-
\$\begingroup\$ And to point you in the right direction, I'd suggest learning more about compilers (and interpreters, but they're included in compilers). We had very good courses in university, but I don't know how to get that good information outside university. The course was called introduction to compiler construction and conveyed the basics very well. EDIT: don't know how much they fit your problem though, I don't think it's what you're searching for. Doesn't hurt though, it was fun getting to know the magic behind compilers! \$\endgroup\$lukstru– lukstru2022年06月18日 12:43:08 +00:00Commented Jun 18, 2022 at 12:43
How certain are you that you will only need key renaming? If this is a real project, your current needs are likely a simplification of your eventual needs. That's just how living software projects behave: you need something, you build something, and the experience with that built thing causes you to need other or different things. Currently, you seem to be performing a simple task: preserving the structural characteristics of the data while renaming the keys. What is the probability that you will need other things in the future: for example, value conversion (eg, int to float) or full-blown data restructuring?
Your need is not novel: do more research to learn how others have dealt with the problem. The Python ecosystem has libraries to perform different kinds of data remappings: here is one called jsonbender. I've never used it and cannot comment on its quality, but a quick scan through the README points to some issues you might want to consider -- notably, dealing with lists, configuring optionality, and building in support for callables to handle computation needs than cannot be easily expressed via a simple configuration syntax (in my own professional experience, that latter has been especially powerful on projects having some overlap with your needs).
Your implementation seems backwards and is thus too limiting. Like one of
the reviews, your remapping (in field_map
) strikes me as backwards: it maps
old paths/keys to new paths/keys. But that is limiting because it provides no
mechanism for controlling the output structure. It also seems less intuitive
than the alternative -- namely, declaring the structure you want and then, at
the leaf nodes, defining how/where to retrieve values from the source. I would
encourage you to define the remapping from the perspective of the desired data.
For example, if we focus just on the IDENTIF
and SUMM
keys (plus a FOO
key added for illustration), one could define a remapping as follows. Each leaf
value can be obtained by diving down though the hierarchy based on the keys
declared in each tuple. Even though this example handles only the easy
situations in your current problem, it does illustrate -- at least to my eye --
the intuitiveness of defining the remapping from the perspective of the desired
output, as well as its greater flexibility in terms of data restructuring,
should that need ever arise.
remapping = {
# Simple dict-to-dict key renaming via data-diving tuples.
"IDENTIF": ('id',),
"SUMM": {
"ORIG": {
"WEBADDRESS": ('summary', 'origin', 'url'),
"LOCATOR": ('summary', 'origin', 'slug'),
},
"TAGS": ('summary', 'tags'),
},
# Restructuring and even reuse of source nodes is possible.
"FOO": {
"BAR": ('id',),
},
}
Dealing with pesky lists. That simple plan falters when it comes to lists.
Your workaround was a double-slash convention and one reviewer suggests a using
explicit types like Node
, DictNode
, and ListNode
to configure the needed
remappings. A middle-ground is to continue with the simplicity of your
convention-based approach but to make it a bit more rigorous. The illustration
above relies on the convention that a dict in the remapping configuration
produces a dict in the output data. We could do the same with lists. The
example below would be interpreted as follows: PURCHASEDGOODS
will hold a
list; we obtain the source data for that list from the key(s) declared inside
the list; and the final element of the configuration-list will contain the
specification for how to build individual values composing the list. I'm not
necessarily advocating this approach, but it does illustrate a low-tech,
convention-based approach with greater intuitiveness and flexibility than your
current idea.
remapping = {
...
"PURCHASEDGOODS": ['items',
{
"GOODSID": ('id',),
"GOODSPRICE": ('price',),
"GIVENDISCOUNT": ('discount',),
}
],
}
Making that approach a bit more formal via explicit types. Another
middle-ground is something like the following. It still relies on some
conventional behavior relating to dicts, but it does have explicit types to
distinguish the two primary ways to retrieve data from the source: (1) simple
data-diving via a tuple of keys or (2) data-diving over a source list to
produce an output list. One benefit of at least adding two types like these is
that they provide a mechanism to configure optionality: for example,
Diver('discount', default = 0)
. It would also provide a way to pass in
callables to handle more complex needs or even simple value-conversion behavior
you might want in the future: for example, Diver('discount', default = 0, convert = float)
.
remapping = {
"IDENTIF": Diver('id'),
"SUMM": {
"ORIG": {
"WEBADDRESS": Diver('summary', 'origin', 'url'),
"LOCATOR": Diver('summary', 'origin', 'slug'),
},
"TAGS": Diver('summary', 'tags'),
},
"PURCHASEDGOODS": ListDiver('items',
{
"GOODSID": Diver('id'),
"GOODSPRICE": Diver('price'),
"GIVENDISCOUNT": Diver('discount'),
},
],
}
Other possibilities. The next obvious extension is to formalize the
dict-related configuration more explicitly (eg DictDiver
). Whether that's
worth the trouble depends on your expectations for the future of the project.
To my mind, that step seems the least compelling: at a certain point, every
project must adopt a variety of conventions and it's no crime to embrace them
if they are intuitive and reasonable. If you were to take that step, you would
end up with an approach similar to the substantive review you already have, but
with the reversed orientation discussed above. Finally, I'll re-emphasize the
recommendation to research other libraries that perform this kind of data
conversion. Even if you end up adopting a low-tech, convention-based solution,
your decision-making should be guided by how others have thought about this
topic. And you might get lucky and find a library that already does
exactly what you need.
translate_nested_structure
andis_valid_iterable
. \$\endgroup\$"a": [[{"b":0}]]
can be mapped with/a///b
\$\endgroup\$