Here is another database problem I stumbled upon.
I have a date-range partitioned MyIsam look-up table with 200M records and ~150 columns. On this Table I need to perform cascading SELECT-Statements to filter the data. Output:
filter 126M
filter 110M
filter 40M
filter 5M
filter 100k
Every single SELECT is highly complex with regex (=no index possible) and multiple comparisons, which is why I want them to query the least amount of rows possible.
There are about 500 unique filters and around 200 constant users. Every filter needs to be run for each user, in total around 100k combinations.
Big question: Is there a way for each subsequent SELECT statement to query only the previous subset?
Example: Filter #5 should only have to query the 5M rows out of query 4 to get those 100k results. At the moment it has to scan through all 200M records.
EDIT current approach: cache table
CREATE TABLE IF NOT EXISTS `cache` (
`filter_id` int(11) NOT NULL,
`user_id` int(11) NOT NULL,
`lookup_id` int(11) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
ALTER TABLE `cache`
ADD PRIMARY KEY (`filter_id`,`user_id`);
This would contain the relation between individual data-rows from the lookup table and the filters. PLUS I'd be able to use the rrimary index to get all of the lookup_ids from the previous filter.
Query for subsequent filters:
SELECT SUM( column), COUNT(*)
FROM cache c
LEFT JOIN lookup_table l ON c.lookup_id= l.id
WHERE
c.filter_id = 1
AND c. user_id= x
AND l.regex_column = preg_rlike...
1 Answer 1
If filter 110M
is exactly a subset of filter 126M
, then appending more AND
with WHERE
would have done the job.
$sql1 = "SELECT ..... WHERE ...";
$sql2 = $sql1 . "AND column-name = ....";
$sql3 = $sql2 . "AND column-name = ....";
If that is complex to accomplish try to CREATE VIEW of the previous SELECT statement and the next SELECT statement should query from the view.
-
yes, each subsequent query is a subset of the previous query. The dynamic - AND structure is working right now, but scanning the whole table (200M rows) each time which is painfully slow. i would need a view for every filter/user combination, which totals in about 100k additional tables. Each row needs 0.5kb of space, the average row count of each view would be 10M. That still comes out to more that 500TB of data, no way my system could handle that :(C-D– C-D2015年04月23日 08:53:45 +00:00Commented Apr 23, 2015 at 8:53
Every single SELECT is highly complex with regex (=no index possible) and multiple comparisons, which is why I want them to query the least amount of rows possible.
- have you looked at indexingGENERATED
columns - you can perhaps index regexes? Not 100% on this!