I have a log table. It logs visitor data (time of visit, browser, window size ...).
I want to create reports. I use a query like this (MySQL):
SELECT
COUNT(*) as raw_views ...
FROM logs
WHERE timestamp >= CURDATE()
GROUP BY DATE(timestamp)
The problem is DATE(timestamp)
. I am using a function on this field, so the index won't work. But the table is huge. Index are very important. But how else can I create hourly, daily, yearly, monthly reports?
I tried to find out how others are doing it. For example Piwik (an open source Google Analytics like script). But I don't really get it.
4 Answers 4
Looking at the original query
SELECT
COUNT(*) as raw_views ...
FROM logs
WHERE timestamp >= CURDATE()
GROUP BY DATE(timestamp)
It appears that you are looking for counts for just today. In that instance, why use GROUP BY
at all ? You should run
SELECT
COUNT(*) as raw_views ...
FROM logs
WHERE timestamp >= CURDATE();
OK, let's get a little more granular. You said you have reports per user_id
Perhaps the query could be adjusted per user
SELECT
COUNT(*) as raw_views
FROM logs
WHERE userid = <some number>
AND timestamp >= CURDATE();
In that case, you can index like this:
ALTER TABLE logs ADD INDEX userid_timestamp_index (user_id,timestamp);
If you are getting all the user raw count in one shot, you would query like this
SELECT
user_id,COUNT(*) as raw_views
FROM logs
WHERE timestamp >= CURDATE()
GROUP BY user_id
;
In that case, you can index like this:
ALTER TABLE logs ADD INDEX timestamp_userid_index (timestamp,user_id);
This is all I can suggest with what you have given. Hope this help !!!
You could create an additional column, date, which stores the function date(timestamp) on insertion. This won't make the group by extremely efficient, but it can avoid the temporary table.
The second problem is the range + GROUP BY, which would make an index on (timestamp, date) useless (BTREE limitations). You can create just an index on (date) or better, on (date, timestamp) if you are using 5.6 with Index Condition Pushdown and do:
SELECT
COUNT(*) as raw_views ...
FROM logs
WHERE date >= CURDATE() and timestamp >= CURDATE()
GROUP BY date
Again, this will be better than before, but it will not make the group by radically efficient.
There are several options for that, the most direct is the one that @Remus Rusanu mentioned, calculating aggregates. If you need them "in real time", you can update them on every insert, adding +1 with a trigger, for example, but that will reduce quite significantly your insertion rate. Usually, those are cached for several usages, as @ypercube mentions, and only recalculated every some time.
Maybe you are using the wrong tool: MySQL is not the best tool for analytics/aggregate results. Think of column-based storages- there are even some propietary implementations as MySQL storage engines.
-
Some, such as InfiniDB are Open Source - Infobright is also, but less so. I think that InfiniDB is an interesting project - it's even starting to put sorely needed windowing functions into MySQL. Even better for this sort of work would be PostgreSQL IF you can change - or use a second server as a data warehouse.Vérace– Vérace2014年07月25日 12:21:28 +00:00Commented Jul 25, 2014 at 12:21
-
Well, I actually used proprietary as in "exclusive/non common/owned" rather than non-free.jynus– jynus2014年07月25日 12:23:11 +00:00Commented Jul 25, 2014 at 12:23
First create a Calendar table (a table of dates) as described in this answer; then extend it by adding and populating Year, MonthName, DayOfMonth, and Tomorrow columns in that table and add a unique index on it by (Year, MonthName and DayOfMonth), and another unique index on it by the base column (Date) - the Primary Key.
Now you can generate your reports by joining to the Calendar table (assume it is called Calendar) like this:
SELECT
Year,MonthName,
count(*) as raw_views
from logs
join Calendar on _data <= Timestamp and TimeStamp < Tomorrow
group by Year,MonthName
This might work:
select sum(N), the_date from (
SELECT COUNT(*) as N, timestamp, DATE(timestamp) as the_date
FROM logs
WHERE timestamp >= CURDATE()
group by timestamp
) as A
GROUP BY the_date -- or: GROUP BY DATE(timestamp)
Your first problem is MySQL, for many reasons, here because function(column)
shouldn't preclude the use of an index on column
. The above query might engender consultation of the column at the price of two aggregations.
GROUP BY
can only be used if you have an aggregate, of which you have none.