I've looked through a lot of solutions on this topic, but I have been unable to adapt my case to a performant one. Suppose I have a list of dictionaries stored as:
db_data = [
{
"start_time": "2020-04-20T17:55:54.000-00:00",
"results": {
"key_1": ["a","b","c","d"],
"key_2": ["a","b","c","d"],
"key_3": ["a","b","c","d"]
}
},
{
"start_time": "2020-04-20T18:32:27.000-00:00",
"results": {
"key_1": ["a","b","c","d"],
"key_2": ["a","b","e","f"],
"key_3": ["a","e","f","g"]
}
},
{
"start_time": "2020-04-21T17:55:54.000-00:00",
"results": {
"key_1": ["a","b","c"],
"key_2": ["a"],
"key_3": ["a","b","c","d"]
}
},
{
"start_time": "2020-04-21T18:32:27.000-00:00",
"results": {
"key_1": ["a","b","c"],
"key_2": ["b"],
"key_3": ["a"]
}
}
]
I am trying to get a data aggregation from the list output as a dictionary, with the key values of the results object as the keys of the output, and the size of the set of unique values for each date for each key.
I am attempting to aggregate the data by date value, and outputting the count of unique values for each key for each day.
Expected output is something like:
{
"key_1": {
"2020-04-20": 4,
"2020-04-21": 3
},
"key_2": {
"2020-04-20": 6,
"2020-04-21": 2
},
"key_3": {
"2020-04-20": 7,
"2020-04-21": 4
}
}
What I have tried so far is using defaultdict
and loops to aggregate the data. This takes a very long time unfortunately:
from datetime import datetime
from collections import defaultdict
grouped_data = defaultdict(dict)
for item in db_data:
group = item['start_time'].strftime('%-b %-d, %Y')
for k, v in item['results'].items():
if group not in grouped_data[k].keys():
grouped_data[k][group] = []
grouped_data[k][group] = v + grouped_data[k][group]
for k, v in grouped_data.items():
grouped_data[k] = {x:len(set(y)) for x, y in v.items()}
print(grouped_data)
Any help or guidance is appreciated. I have read that pandas
might help here, but I am not quite sure how to adapt this use case.
-
\$\begingroup\$ How many entries, approximately, are in your input data? \$\endgroup\$Reinderien– Reinderien2020年04月20日 19:31:23 +00:00Commented Apr 20, 2020 at 19:31
-
\$\begingroup\$ there are about 500 now, but that will continue to grow pretty steadily \$\endgroup\$Jasper Sardonicus– Jasper Sardonicus2020年04月20日 19:49:46 +00:00Commented Apr 20, 2020 at 19:49
-
\$\begingroup\$ And how slow is "too slow"? How long does this take to execute currently? And what is the prior system that is generating these data? \$\endgroup\$Reinderien– Reinderien2020年04月20日 19:50:23 +00:00Commented Apr 20, 2020 at 19:50
-
\$\begingroup\$ too slow is that those 500 records took ~50 seconds to process \$\endgroup\$Jasper Sardonicus– Jasper Sardonicus2020年04月21日 02:49:30 +00:00Commented Apr 21, 2020 at 2:49
-
\$\begingroup\$ Can you guarantee that the inner values are one character as you've shown? \$\endgroup\$Reinderien– Reinderien2020年04月21日 03:46:11 +00:00Commented Apr 21, 2020 at 3:46
2 Answers 2
Try:
from collections import defaultdict
from datetime import date, datetime
from typing import DefaultDict, Set, List, Dict
DefaultSet = DefaultDict[date, Set[str]]
def default_set() -> DefaultSet:
return defaultdict(set)
aggregated: DefaultDict[str, DefaultSet] = defaultdict(default_set)
for entry in db_data:
start_date: date = datetime.fromisoformat(entry['start_time']).date()
result: Dict[str, List[str]] = entry['results']
for k, v in result.items():
aggregated[k][start_date].update(v)
grouped_data: Dict[str, Dict[date, int]] = {
k: {gk: len(gv) for gk, gv in group.items()}
for k, group in aggregated.items()
}
Notes:
- I do not know if this is faster, but it's certainly simpler
- If you're able, maintain the output with actual
date
keys - Your data are better-modeled by a
defaultdict
ofdefaultdict
s ofset
s. - I used a bunch of type hints to make sure that I'm doing the right thing.
Hmm, I tested it out and it was much better than mine, but still a bit slow. This is what I finally went with though from another forum post:
>>> from collections import defaultdict
>>> from functools import partial
>>>
>>> flat_list = ((key, db_item['start_time'][:10], results)
... for db_item in db_data
... for key, results in db_item['results'].items())
>>>
>>> d = defaultdict(partial(defaultdict, set))
>>>
>>> for key, date, li in flat_list:
... d[key][date].update(li)
It works really well! It improved processing time from 50 seconds to 2 seconds
-
\$\begingroup\$ This does not look complete, though. After the
update
step there is still alen
required somewhere. \$\endgroup\$Reinderien– Reinderien2020年04月21日 15:06:16 +00:00Commented Apr 21, 2020 at 15:06 -
\$\begingroup\$ That's true, though in the actual display of this data (I am using plotly), I can iterate the output of the above to build the plotly data and call len(d[key][date]) to get the values at runtime. \$\endgroup\$Jasper Sardonicus– Jasper Sardonicus2020年04月22日 17:12:28 +00:00Commented Apr 22, 2020 at 17:12