Best approach for converting list of nested dictionaries to a single dictionary with aggregate functions

Question 1

I've looked through a lot of solutions on this topic, but I have been unable to adapt my case to a performant one. Suppose I have a list of dictionaries stored as:

db_data = [
 {
 "start_time": "2020-04-20T17:55:54.000-00:00",
 "results": {
 "key_1": ["a","b","c","d"],
 "key_2": ["a","b","c","d"],
 "key_3": ["a","b","c","d"]
 }
 },
 {
 "start_time": "2020-04-20T18:32:27.000-00:00",
 "results": {
 "key_1": ["a","b","c","d"],
 "key_2": ["a","b","e","f"],
 "key_3": ["a","e","f","g"]
 }
 },
 {
 "start_time": "2020-04-21T17:55:54.000-00:00",
 "results": {
 "key_1": ["a","b","c"],
 "key_2": ["a"],
 "key_3": ["a","b","c","d"]
 }
 },
 {
 "start_time": "2020-04-21T18:32:27.000-00:00",
 "results": {
 "key_1": ["a","b","c"],
 "key_2": ["b"],
 "key_3": ["a"]
 }
 }
]

I am trying to get a data aggregation from the list output as a dictionary, with the key values of the results object as the keys of the output, and the size of the set of unique values for each date for each key.

I am attempting to aggregate the data by date value, and outputting the count of unique values for each key for each day.

Expected output is something like:

{
 "key_1": {
 "2020-04-20": 4,
 "2020-04-21": 3
 },
 "key_2": {
 "2020-04-20": 6,
 "2020-04-21": 2
 },
 "key_3": {
 "2020-04-20": 7,
 "2020-04-21": 4
 }
}

What I have tried so far is using defaultdict and loops to aggregate the data. This takes a very long time unfortunately:

from datetime import datetime
from collections import defaultdict
grouped_data = defaultdict(dict)
for item in db_data:
 group = item['start_time'].strftime('%-b %-d, %Y')
 for k, v in item['results'].items():
 if group not in grouped_data[k].keys():
 grouped_data[k][group] = []
 grouped_data[k][group] = v + grouped_data[k][group]
for k, v in grouped_data.items():
 grouped_data[k] = {x:len(set(y)) for x, y in v.items()}
print(grouped_data)

Any help or guidance is appreciated. I have read that pandas might help here, but I am not quite sure how to adapt this use case.

Question 2

How many entries, approximately, are in your input data?

Question 3

there are about 500 now, but that will continue to grow pretty steadily

Question 4

And how slow is "too slow"? How long does this take to execute currently? And what is the prior system that is generating these data?

Question 5

too slow is that those 500 records took ~50 seconds to process

Question 6

Can you guarantee that the inner values are one character as you've shown?

Question 7

Try:

from collections import defaultdict
from datetime import date, datetime
from typing import DefaultDict, Set, List, Dict
DefaultSet = DefaultDict[date, Set[str]]
def default_set() -> DefaultSet:
 return defaultdict(set)
aggregated: DefaultDict[str, DefaultSet] = defaultdict(default_set)
for entry in db_data:
 start_date: date = datetime.fromisoformat(entry['start_time']).date()
 result: Dict[str, List[str]] = entry['results']
 for k, v in result.items():
 aggregated[k][start_date].update(v)
grouped_data: Dict[str, Dict[date, int]] = {
 k: {gk: len(gv) for gk, gv in group.items()}
 for k, group in aggregated.items()
}

Notes:

I do not know if this is faster, but it's certainly simpler
If you're able, maintain the output with actual date keys
Your data are better-modeled by a defaultdict of defaultdicts of sets.
I used a bunch of type hints to make sure that I'm doing the right thing.

Question 8

Hmm, I tested it out and it was much better than mine, but still a bit slow. This is what I finally went with though from another forum post:

>>> from collections import defaultdict
>>> from functools import partial
>>>
>>> flat_list = ((key, db_item['start_time'][:10], results)
... for db_item in db_data
... for key, results in db_item['results'].items())
>>> 
>>> d = defaultdict(partial(defaultdict, set))
>>> 
>>> for key, date, li in flat_list:
... d[key][date].update(li)

It works really well! It improved processing time from 50 seconds to 2 seconds

Question 9

This does not look complete, though. After the update step there is still a len required somewhere.

Question 10

That's true, though in the actual display of this data (I am using plotly), I can iterate the output of the above to build the plotly data and call len(d[key][date]) to get the values at runtime.

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Answer 1 · 2020-04-20 20:36:14Z

Try:

from collections import defaultdict
from datetime import date, datetime
from typing import DefaultDict, Set, List, Dict
DefaultSet = DefaultDict[date, Set[str]]
def default_set() -> DefaultSet:
 return defaultdict(set)
aggregated: DefaultDict[str, DefaultSet] = defaultdict(default_set)
for entry in db_data:
 start_date: date = datetime.fromisoformat(entry['start_time']).date()
 result: Dict[str, List[str]] = entry['results']
 for k, v in result.items():
 aggregated[k][start_date].update(v)
grouped_data: Dict[str, Dict[date, int]] = {
 k: {gk: len(gv) for gk, gv in group.items()}
 for k, group in aggregated.items()
}

Notes:

I do not know if this is faster, but it's certainly simpler
If you're able, maintain the output with actual date keys
Your data are better-modeled by a defaultdict of defaultdicts of sets.
I used a bunch of type hints to make sure that I'm doing the right thing.

Jasper Sardonicus Jasper Sardonicus 412 bronze badges · Answer 2 · 2020-04-21 03:02:34Z

Hmm, I tested it out and it was much better than mine, but still a bit slow. This is what I finally went with though from another forum post:

>>> from collections import defaultdict
>>> from functools import partial
>>>
>>> flat_list = ((key, db_item['start_time'][:10], results)
... for db_item in db_data
... for key, results in db_item['results'].items())
>>> 
>>> d = defaultdict(partial(defaultdict, set))
>>> 
>>> for key, date, li in flat_list:
... d[key][date].update(li)

It works really well! It improved processing time from 50 seconds to 2 seconds

This does not look complete, though. After the update step there is still a len required somewhere.
That's true, though in the actual display of this data (I am using plotly), I can iterate the output of the above to build the plotly data and call len(d[key][date]) to get the values at runtime.

Stack Exchange Network

Best approach for converting list of nested dictionaries to a single dictionary with aggregate functions

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Best approach for converting list of nested dictionaries to a single dictionary with aggregate functions

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions