3
\$\begingroup\$

I've looked through a lot of solutions on this topic, but I have been unable to adapt my case to a performant one. Suppose I have a list of dictionaries stored as:

db_data = [
 {
 "start_time": "2020-04-20T17:55:54.000-00:00",
 "results": {
 "key_1": ["a","b","c","d"],
 "key_2": ["a","b","c","d"],
 "key_3": ["a","b","c","d"]
 }
 },
 {
 "start_time": "2020-04-20T18:32:27.000-00:00",
 "results": {
 "key_1": ["a","b","c","d"],
 "key_2": ["a","b","e","f"],
 "key_3": ["a","e","f","g"]
 }
 },
 {
 "start_time": "2020-04-21T17:55:54.000-00:00",
 "results": {
 "key_1": ["a","b","c"],
 "key_2": ["a"],
 "key_3": ["a","b","c","d"]
 }
 },
 {
 "start_time": "2020-04-21T18:32:27.000-00:00",
 "results": {
 "key_1": ["a","b","c"],
 "key_2": ["b"],
 "key_3": ["a"]
 }
 }
]

I am trying to get a data aggregation from the list output as a dictionary, with the key values of the results object as the keys of the output, and the size of the set of unique values for each date for each key.

I am attempting to aggregate the data by date value, and outputting the count of unique values for each key for each day.

Expected output is something like:

{
 "key_1": {
 "2020-04-20": 4,
 "2020-04-21": 3
 },
 "key_2": {
 "2020-04-20": 6,
 "2020-04-21": 2
 },
 "key_3": {
 "2020-04-20": 7,
 "2020-04-21": 4
 }
}

What I have tried so far is using defaultdict and loops to aggregate the data. This takes a very long time unfortunately:

from datetime import datetime
from collections import defaultdict
grouped_data = defaultdict(dict)
for item in db_data:
 group = item['start_time'].strftime('%-b %-d, %Y')
 for k, v in item['results'].items():
 if group not in grouped_data[k].keys():
 grouped_data[k][group] = []
 grouped_data[k][group] = v + grouped_data[k][group]
for k, v in grouped_data.items():
 grouped_data[k] = {x:len(set(y)) for x, y in v.items()}
print(grouped_data)

Any help or guidance is appreciated. I have read that pandas might help here, but I am not quite sure how to adapt this use case.

asked Apr 20, 2020 at 18:03
\$\endgroup\$
6
  • \$\begingroup\$ How many entries, approximately, are in your input data? \$\endgroup\$ Commented Apr 20, 2020 at 19:31
  • \$\begingroup\$ there are about 500 now, but that will continue to grow pretty steadily \$\endgroup\$ Commented Apr 20, 2020 at 19:49
  • \$\begingroup\$ And how slow is "too slow"? How long does this take to execute currently? And what is the prior system that is generating these data? \$\endgroup\$ Commented Apr 20, 2020 at 19:50
  • \$\begingroup\$ too slow is that those 500 records took ~50 seconds to process \$\endgroup\$ Commented Apr 21, 2020 at 2:49
  • \$\begingroup\$ Can you guarantee that the inner values are one character as you've shown? \$\endgroup\$ Commented Apr 21, 2020 at 3:46

2 Answers 2

1
\$\begingroup\$

Try:

from collections import defaultdict
from datetime import date, datetime
from typing import DefaultDict, Set, List, Dict
DefaultSet = DefaultDict[date, Set[str]]
def default_set() -> DefaultSet:
 return defaultdict(set)
aggregated: DefaultDict[str, DefaultSet] = defaultdict(default_set)
for entry in db_data:
 start_date: date = datetime.fromisoformat(entry['start_time']).date()
 result: Dict[str, List[str]] = entry['results']
 for k, v in result.items():
 aggregated[k][start_date].update(v)
grouped_data: Dict[str, Dict[date, int]] = {
 k: {gk: len(gv) for gk, gv in group.items()}
 for k, group in aggregated.items()
}

Notes:

  • I do not know if this is faster, but it's certainly simpler
  • If you're able, maintain the output with actual date keys
  • Your data are better-modeled by a defaultdict of defaultdicts of sets.
  • I used a bunch of type hints to make sure that I'm doing the right thing.
answered Apr 20, 2020 at 20:36
\$\endgroup\$
1
\$\begingroup\$

Hmm, I tested it out and it was much better than mine, but still a bit slow. This is what I finally went with though from another forum post:

>>> from collections import defaultdict
>>> from functools import partial
>>>
>>> flat_list = ((key, db_item['start_time'][:10], results)
... for db_item in db_data
... for key, results in db_item['results'].items())
>>> 
>>> d = defaultdict(partial(defaultdict, set))
>>> 
>>> for key, date, li in flat_list:
... d[key][date].update(li)

It works really well! It improved processing time from 50 seconds to 2 seconds

answered Apr 21, 2020 at 3:02
\$\endgroup\$
2
  • \$\begingroup\$ This does not look complete, though. After the update step there is still a len required somewhere. \$\endgroup\$ Commented Apr 21, 2020 at 15:06
  • \$\begingroup\$ That's true, though in the actual display of this data (I am using plotly), I can iterate the output of the above to build the plotly data and call len(d[key][date]) to get the values at runtime. \$\endgroup\$ Commented Apr 22, 2020 at 17:12

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.