Design of an application log database

Question 1

We are generating logs like following example(this is a table no actual pipes):

2014年06月10日 09:00:03.457 | Channel1 | Operation3 | Function15 | 15ms 
2014年06月10日 09:00:08.245 | Channel2 | Operation5 | Function10 | 22ms 
2014年06月10日 09:00:22.005 | Channel1 | Operation3 | Function15 | 48ms

Think about this with at least 25-30 columns of time-series application log data. Every row is a transaction. So I aggregate the same ones and get the sum(also average of the durations). After the aggregation, my unique clustured key is all of the columns except the metrics like transaction duration or sum.

In example between 2014年06月10日 09:00:00 - 2014年06月10日 09:01:00 our example will be:

2014年06月10日 09:00:00 | Channel1 | Operation3 | Function15 | 2 | 31,5ms 
2014年06月10日 09:00:00 | Channel2 | Operation5 | Function10 | 1 | 22ms

Is there a better way to do this? Processing this data also costs me a lot while showing it to users for monitoring and analyzing purposes.

UPDATE-1

I think this question need more clarification. The raw log data is like in the first example. I'm running an ETL agent that gets it as per minute interval and aggregates to an another table like in the second example.

The second table has a primary key of all columns except metric columns(Count,ResponseTime). Because in the end 2014年06月10日 09:00:00 | Channel1 | Operation3 | Function15 is the only thing that gives me uniqueness.

When the user wants to analyze it. He/She chooses values from every column which I named "Dimensions". He/She wants to see "Function15 transactions which are on Channel1" or "Operation5 response times on Channel1" and so on. I'm storing the data like this to achieve these requests from the user.

Regards

Question 2

How frequently do you have to calculate this? make it available to end users? What is the expectation for availability of these calculated metrics to be available after the event has occurred?

Question 3

About every 1 minute. It needs to be available to the end user instantly.

Question 4

What do you mean "After the aggregation"? Are you aggregating the raw data into a different table? Is so, why not just use a view to aggregate it instead?

Question 5

Yes, I am aggregating to different table. I have a lot of logs for different applications. There is too much data which I don't need for monitoring. And these source tables can be in Oracle, DB2, MSSQL. I'm summarizing and taking them to another table.

Question 6

Is there a hierarchy of Channel/Operation/Function or is any combination valid? I'm tempted to suggest you scoop the data from the raw log every minute and load to a star schema.

Question 7

I did the following

CREATE TABLE L(
Time_Series_TS TIMESTAMP, 
Channel VARCHAR(10), 
Operation VARCHAR(10), 
Function VARCHAR(10), 
Duration INT);

Then

INSERT INTO L VALUES('2014-06-10 09:00:03.457', 'Channel1', 'Operation3', 'Function15', 15);
INSERT INTO L VALUES('2014-06-10 09:00:08.245', 'Channel2', 'Operation5', 'Function10', 22);
INSERT INTO L VALUES('2014-06-10 09:00:22.005', 'Channel1', 'Operation3', 'Function15', 48);
INSERT INTO L VALUES('2014-06-10 09:01:03.457', 'Channel2', 'Operation3', 'Function15', 296);
INSERT INTO L VALUES('2014-06-10 09:01:08.245', 'Channel2', 'Operation5', 'Function10', 225);
INSERT INTO L VALUES('2014-06-10 09:01:22.005', 'Channel1', 'Operation3', 'Function15', 7);
INSERT INTO L VALUES('2014-06-10 09:01:16.245', 'Channel2', 'Operation5', 'Function10', 10);
INSERT INTO L VALUES('2014-06-10 09:01:47.005', 'Channel1', 'Operation3', 'Function15', 20);

I added a few records to your sample for checking. Then ran this query

SELECT MINUTE(Time_Series_TS) AS Minute, Channel, Operation, Function, 
COUNT(*) AS "Count/min", SUM(Duration) AS Duration 
FROM L
GROUP BY Minute, Channel, Operation, Function
ORDER By Minute, Channel, Operation, Function;

Which gave

+--------+----------+------------+------------+-----------+----------+
| Minute | Channel | Operation | Function | Count/min | Duration |
+--------+----------+------------+------------+-----------+----------+
| 0 | Channel1 | Operation3 | Function15 | 2 | 63 |
| 0 | Channel2 | Operation5 | Function10 | 1 | 22 |
| 1 | Channel1 | Operation3 | Function15 | 2 | 27 |
| 1 | Channel2 | Operation3 | Function15 | 1 | 296 |
| 1 | Channel2 | Operation5 | Function10 | 2 | 235 |
+--------+----------+------------+------------+-----------+----------+

Which appears to be the result you want (note 63 as the 1st duration as per my earlier comment). Is this the result you wanted? You can then use HOUR() and DAYOFMONTH() and even YEAR() to aggregate over these also with this query.

For performance, I did create an index

CREATE INDEX L_Index ON L(Channel, Operation, Function) using BTREE;

and explained the query before and after creating it, but there was no difference. This is hardly a surprise, since the optimizer probably said that there's no point in using one for such a small table. Obviously, I can't test with your data, but there are a couple of points. If you are performing this operation over a large number of records with a large no. of fields, you may run into issues and if you create many indexes, your insert performance will decrease. Is it possible for you to categorise your data in some way to reduce the number of fields - i.e. split your big table into ones with a smaller number of fields? Check out different scenarios, test and see what happens with your data, your queries, your application and your hardware.

[EDIT]

For something more human readable, you might like to try something like

SELECT TIME(FROM_UNIXTIME(UNIX_TIMESTAMP(Time_Series_TS) - MOD(UNIX_TIMESTAMP(Time_Series_TS), 60))) AS Minute,
..
..

for your first field.

[EDIT - Response to UPDATE-1]

OK - so in my schema, you are indexing by (Minute, Channel, Operation, Function)? See here for the docco on composite indexes in MySQL. If your queries have a predominatly left-right orientation, i.e you [always | usually] query Channel first and then Operation, then Function, you could try an index on Minute + (the usual three). If it's fairly arbitrary, then you could try using 6 indexes, but this will hit insert performance. How much, I can't say, but if this is a DW type app which performs the analysis, you can batch the inserts and only occasionally take the hit for that. You'll have to do a few tests with realistic data and EXPLAIN your queries - with realistic sample data, as I said earlier, the Optimiser with just a few records ignores indexes because the table is too small. Interestingly, on the MySQL man page given above, there's a hashing strategy which looks interesting - take MD5 hashes of CONCAT(Your_Column_List_Here). One other thing that I can suggest is that instead of using the

SELECT TIME(FROM_UNIXTIME(UNIX_TIMESTAMP(Time_Series_TS) - MOD(UNIX_TIMESTAMP(Time_Series_TS), 60))) AS Minute,...

Just remove the TIME() function and then you'll be storing INTs which appears to be better than indexes on DATETIMES - see here for a benchmark. Also as previously mentioned, you should remove your data from Production and perform the OLAP/DW on another machine. You could also test out the InfiniDB solution that I suggested. It's drop-in compatible with MySQL (no learning curve). Then there are all the NoSQL solutions - we could be here all day :-). Take a look at a few scenarios, evaluate and test and then choose what best fits your budget and requirements. Forgot: Make your OLAP/DW system read only for performing queries - no transactional overhead! Make the OLAP/DW tables MyISAM? This last one is controversial - again, test and see.

Question 8

I have 200.000-300.000 records in a minute interval. This query will be slow in my opinion. What is your comment about this?

Question 9

Not knowing your app, hardware and disk config, it's difficult to begin. First thing I'd do is get the data to be queried off production onto another machine, set read-only (see here. Lots of strategies which you can employ - master-slave, incremental dumps (XtraBackup (Percona) if InnoDB). Your data may suit InfiniDB a columnar store for OLAP. There are many ways to skin this cat and we'd need to know a lot more about your config to really help.

Question 10

I've added some clarification to the question.

Question 11

Try discretize your data. You can do this by using staging tables. The result will be like this:

time | channel | operation | function | timecost | count
2014年06月10日 09:00:00 | Channel1 | Operation3 | Function15 | >30ms | 1
2014年06月10日 09:00:00 | Channel2 | Operation3 | Function15 | 0-5ms | 1
2014年06月10日 09:00:00 | Channel1 | Operation3 | Function15 | >30ms | 1

Vérace Vérace 31k9 gold badges73 silver badges86 bronze badges · Accepted Answer · 2014-06-17 01:11:25Z

I did the following

CREATE TABLE L(
Time_Series_TS TIMESTAMP, 
Channel VARCHAR(10), 
Operation VARCHAR(10), 
Function VARCHAR(10), 
Duration INT);

Then

INSERT INTO L VALUES('2014-06-10 09:00:03.457', 'Channel1', 'Operation3', 'Function15', 15);
INSERT INTO L VALUES('2014-06-10 09:00:08.245', 'Channel2', 'Operation5', 'Function10', 22);
INSERT INTO L VALUES('2014-06-10 09:00:22.005', 'Channel1', 'Operation3', 'Function15', 48);
INSERT INTO L VALUES('2014-06-10 09:01:03.457', 'Channel2', 'Operation3', 'Function15', 296);
INSERT INTO L VALUES('2014-06-10 09:01:08.245', 'Channel2', 'Operation5', 'Function10', 225);
INSERT INTO L VALUES('2014-06-10 09:01:22.005', 'Channel1', 'Operation3', 'Function15', 7);
INSERT INTO L VALUES('2014-06-10 09:01:16.245', 'Channel2', 'Operation5', 'Function10', 10);
INSERT INTO L VALUES('2014-06-10 09:01:47.005', 'Channel1', 'Operation3', 'Function15', 20);

I added a few records to your sample for checking. Then ran this query

SELECT MINUTE(Time_Series_TS) AS Minute, Channel, Operation, Function, 
COUNT(*) AS "Count/min", SUM(Duration) AS Duration 
FROM L
GROUP BY Minute, Channel, Operation, Function
ORDER By Minute, Channel, Operation, Function;

Which gave

+--------+----------+------------+------------+-----------+----------+
| Minute | Channel | Operation | Function | Count/min | Duration |
+--------+----------+------------+------------+-----------+----------+
| 0 | Channel1 | Operation3 | Function15 | 2 | 63 |
| 0 | Channel2 | Operation5 | Function10 | 1 | 22 |
| 1 | Channel1 | Operation3 | Function15 | 2 | 27 |
| 1 | Channel2 | Operation3 | Function15 | 1 | 296 |
| 1 | Channel2 | Operation5 | Function10 | 2 | 235 |
+--------+----------+------------+------------+-----------+----------+

Which appears to be the result you want (note 63 as the 1st duration as per my earlier comment). Is this the result you wanted? You can then use HOUR() and DAYOFMONTH() and even YEAR() to aggregate over these also with this query.

For performance, I did create an index

CREATE INDEX L_Index ON L(Channel, Operation, Function) using BTREE;

and explained the query before and after creating it, but there was no difference. This is hardly a surprise, since the optimizer probably said that there's no point in using one for such a small table. Obviously, I can't test with your data, but there are a couple of points. If you are performing this operation over a large number of records with a large no. of fields, you may run into issues and if you create many indexes, your insert performance will decrease. Is it possible for you to categorise your data in some way to reduce the number of fields - i.e. split your big table into ones with a smaller number of fields? Check out different scenarios, test and see what happens with your data, your queries, your application and your hardware.

[EDIT]

For something more human readable, you might like to try something like

SELECT TIME(FROM_UNIXTIME(UNIX_TIMESTAMP(Time_Series_TS) - MOD(UNIX_TIMESTAMP(Time_Series_TS), 60))) AS Minute,
..
..

for your first field.

[EDIT - Response to UPDATE-1]

OK - so in my schema, you are indexing by (Minute, Channel, Operation, Function)? See here for the docco on composite indexes in MySQL. If your queries have a predominatly left-right orientation, i.e you [always | usually] query Channel first and then Operation, then Function, you could try an index on Minute + (the usual three). If it's fairly arbitrary, then you could try using 6 indexes, but this will hit insert performance. How much, I can't say, but if this is a DW type app which performs the analysis, you can batch the inserts and only occasionally take the hit for that. You'll have to do a few tests with realistic data and EXPLAIN your queries - with realistic sample data, as I said earlier, the Optimiser with just a few records ignores indexes because the table is too small. Interestingly, on the MySQL man page given above, there's a hashing strategy which looks interesting - take MD5 hashes of CONCAT(Your_Column_List_Here). One other thing that I can suggest is that instead of using the

SELECT TIME(FROM_UNIXTIME(UNIX_TIMESTAMP(Time_Series_TS) - MOD(UNIX_TIMESTAMP(Time_Series_TS), 60))) AS Minute,...

Just remove the TIME() function and then you'll be storing INTs which appears to be better than indexes on DATETIMES - see here for a benchmark. Also as previously mentioned, you should remove your data from Production and perform the OLAP/DW on another machine. You could also test out the InfiniDB solution that I suggested. It's drop-in compatible with MySQL (no learning curve). Then there are all the NoSQL solutions - we could be here all day :-). Take a look at a few scenarios, evaluate and test and then choose what best fits your budget and requirements. Forgot: Make your OLAP/DW system read only for performing queries - no transactional overhead! Make the OLAP/DW tables MyISAM? This last one is controversial - again, test and see.

I have 200.000-300.000 records in a minute interval. This query will be slow in my opinion. What is your comment about this?
Not knowing your app, hardware and disk config, it's difficult to begin. First thing I'd do is get the data to be queried off production onto another machine, set read-only (see here. Lots of strategies which you can employ - master-slave, incremental dumps (XtraBackup (Percona) if InnoDB). Your data may suit InfiniDB a columnar store for OLAP. There are many ways to skin this cat and we'd need to know a lot more about your config to really help.

Stack Exchange Network

Design of an application log database

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Design of an application log database

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions