I've spent the last day trying to get an aggregation over a time series from my db. I tried to use the Django ORM but quickly gave up and went running back to SQL. I don't think there's a way to use PSQL generate_series with it, I assume they'd prefer you to use itertools or another method in python.
I have a model much like this:
class Vote(models.Model):
value = models.IntegerField(default=0)
timestamp = models.DateTimeField('date voted', auto_now_add=True)
location = models.ForeignKey('location', on_delete=models.CASCADE)
What I want to do, is show a series of metrics over time -- for now, an aggregation per hour of the current day for the current user. The user has a timezone set (defaults to 'America/Chicago'). I've been jacking around with the postgres query, inserting tons of AS TIME ZONE casts in an effort to wrangle the bounds and return values of the query. I had it returning the correct results late last night but this morning, it's off again. I know it's got to be something very dumb that I'm doing. I even resorted to double-casting timestamps because of the weird way Postgres handles AT TIME ZONE (correcting TO UTC instead of FROM)
Again, I'd like to show buckets of aggregates for each hour of the user's current day up to/including 'now'.
This is my current query:
WITH hour_intervals AS (
SELECT * FROM generate_series(date_trunc('day',(SELECT TIMESTAMP 'today' AT TIME ZONE 'UTC' AT TIME ZONE %s)), (LOCALTIMESTAMP AT TIME ZONE 'UTC' AT TIME ZONE %s), '1 hour') start_time
)
SELECT f.start_time,
COUNT(id) total,
COUNT(CASE WHEN value > 0 THEN 1 END) AS positive_votes,
COUNT(CASE WHEN value = 0 THEN 1 END) AS indifferent_votes,
COUNT(CASE WHEN value < 0 THEN 1 END) AS negative_votes,
SUM(CASE WHEN value > 0 THEN 2 WHEN value = 0 THEN 1 WHEN value < 0 THEN -4 END) AS score
FROM votes_vote m
RIGHT JOIN hour_intervals f
ON m.timestamp AT TIME ZONE %s >= f.start_time AND m.timestamp AT TIME ZONE %s < f.start_time + '1 hour'::interval
AND m.location_id = %s
GROUP BY f.start_time
ORDER BY f.start_time
DEBUGGING INFO
Django 1.9.2 and my settings.py has USE_TZ=True
Postgres 9.5.2 and my login role for django has
ALTER ROLE yesno_django
SET client_encoding = 'utf8';
ALTER ROLE yesno_django
SET default_transaction_isolation = 'read committed';
ALTER ROLE yesno_django
SET TimeZone = 'UTC';
UPDATE Fiddling with the query some more, this is now a working query for today's votes...
WITH hour_intervals AS (
SELECT * FROM generate_series((SELECT TIMESTAMP 'today' AT TIME ZONE 'UTC'), (LOCALTIMESTAMP AT TIME ZONE 'UTC' AT TIME ZONE %s), '1 hour') start_time
)
SELECT f.start_time,
COUNT(id) total,
COUNT(CASE WHEN value > 0 THEN 1 END) AS positive_votes,
COUNT(CASE WHEN value = 0 THEN 1 END) AS indifferent_votes,
COUNT(CASE WHEN value < 0 THEN 1 END) AS negative_votes,
SUM(CASE WHEN value > 0 THEN 2 WHEN value = 0 THEN 1 WHEN value < 0 THEN -4 END) AS score
FROM votes_vote m
RIGHT JOIN hour_intervals f
ON m.timestamp AT TIME ZONE %s >= f.start_time AND m.timestamp AT TIME ZONE %s < f.start_time + '1 hour'::interval
AND m.location_id = %s
GROUP BY f.start_time
ORDER BY f.start_time
How come the query I had earlier worked perfectly from 7pm to 10pmish last night but then fails today? Should I expect this new query to fall down as well?
Can someone explain where I went wrong the first time (or every time)?
2 Answers 2
First, add related_name='votes' into your foreign key to location, for better control, now using location model you can do:
from django.db.models import Count, Case, Sum, When, IntegerField
from django.db.models.expressions import DateTime
queryset = location.objects.annotate(
datetimes=DateTime('votes__timestamp', 'hour', tz),
positive_votes=Count(Case(
When(votes__value__gt=0, then=1),
default=None,
output_field=IntegerField())),
indifferent_votes=Count(Case(
When(votes__value=0, then=1),
default=None,
output_field=IntegerField())),
negative_votes=Count(Case(
When(votes__value__lt=0, then=1),
default=None,
output_field=IntegerField())),
score=Sum(Case(
When(votes__value__lt=0, then=-4),
When(votes__value=0, then=1),
When(votes__value__gt=0, then=2),
output_field=IntegerField())),
).values_list('datetimes', 'positive_votes', 'indifferent_votes', 'negative_votes', 'score').distinct().order_by('datetimes')
That will generate statistics for each of location. You can of course filter it to any location or time range.
5 Comments
ValueError: Database returned an invalid value in QuerySet.datetimes(). Are time zone definitions for your database and pytz installed? when calling from Django. Looks like there might be a bug: code.djangoproject.com/ticket/25937#comment:1 tz is none and you have timezone supoort globally enabled in django, it will throw that error. So you must set timezone every time. And yes, one drawback of that query is ommiting hours without any vote.USE_TZ setting set to True, you must set time zone object as third parameter of DateTime. If you have USE_TZ set to False, try to send None instead.If the datetime fields you are dealing will allow nulls you can work around https://code.djangoproject.com/ticket/25937 with the following:
Potato.objects.annotate(
time=Coalesce(
TruncMonth('removed', tzinfo=timezone.UTC()),
Value(datetime.min.replace(tzinfo=timezone.UTC()),
).values('time').annotate(c=Count('pk'))
This replaces the NULL times with an easy to spot sentinel. if you were already using datetime.min, you'll have to come up with something else.
I'm using this in production, but I've found that where TruncMonth() on it's own would give you local time, when you put Coalesce() around it you can have only naive or UTC.
DATE_TRUNC? Django have built-in option for using it.votes = Vote.objects.filter(location=l).filter(timestamp__date=timezone.now().date()).extra({"hour":"date_trunc('hour',timestamp)"}).values("hour").order_by().annotate(score=score_annotation, count=Count('id'))I think it's close -- I'm going to play with this method a bit more. thanks!