I have a SQLite table named hashbands
that looks like this:
hashband | file_id | window_id
------------------------------
potato | 0 | 0
potato | 1 | 0
The table has 100M+ rows (lol). I need to find all hashbands that have two or more distinct file_id values, and then I need to fetch all rows that contain those hashbands and order the results by hashband. Right now I'm using this:
WITH file_id_counts AS (
SELECT hashband, COUNT(DISTINCT(file_id)) as count
FROM hashbands
GROUP BY hashband
HAVING COUNT > 1
) SELECT hashband, file_id, window_id
FROM hashbands
WHERE hashband IN (SELECT hashband from file_id_counts)
ORDER BY hashband
Does anyone see ways to expedite this query? Any pointers would be helpful!
-
Is this question off topic here?duhaime– duhaime2021年03月30日 12:04:55 +00:00Commented Mar 30, 2021 at 12:04
2 Answers 2
Likely the majority of the performance issue you're seeing is just the normal constraints of the data limitations of SQLite. Your query is of sound mind, and I don't believe there's much you can do to optimize it other than re-write the WHERE
predicate to something more efficiently relational with an INNER JOIN
like this:
WITH file_id_counts AS (
SELECT hashband, COUNT(DISTINCT(file_id)) as count
FROM hashbands
GROUP BY hashband
HAVING COUNT > 1
) SELECT hashband, file_id, window_id
FROM hashbands
INNER JOIN file_id_counts
ON hashbands.hashband = file_id_counts.hashband
ORDER BY hashband
This is logically equivalent and the reasoning it could be faster is because the IN
clause is syntactical sugar for a bunch of OR
clauses which can be less efficient than a direct single equality operator like the INNER JOIN
is doing above.
Additionally, ensuring your hashbands
table has an index on at least (hashband)
or possibly on (hashband, file_id)
should help (if one doesn't exist already).
Finally, if possible, removing your ORDER BY
clause, and instead doing the sorting in your consuming application would probably help a small amount as well. While this mostly just moves the onus of the sort to a different part of the call stack, generally sorting in the database can add a little extra complexity that sometimes can be resolved and slightly more efficient to do in the consuming application. Plus, sorting is really presentation logic in my opinion (at least when not used for a functional purpose of the query).
Where does this SQLite database live?...a mobile application, or on its own server?
-
Thank you for these comments! This DB is for identifying text reuse and will live on the personal machines of those who use my lab's little software package for identifying text reuse. Ultimately we'll make the DB configurable so that users who are capable of installing & managing a more robust db will be able to do so, but those who are new to programming will be able to use the hassle-free sqlite db. Should I remove the default ROWID and use hashband as the primary key on the hashband table? Seems like that should help...duhaime– duhaime2021年03月29日 15:38:57 +00:00Commented Mar 29, 2021 at 15:38
-
@duhaime Don't think you'll be able to create a primary key on the
hashband
field since it's not unique. But you'll want to create an index on thehashband
field regardless if you use the defaultROWID
or not, which will be dependent on if you cluster onhashband
. You can read more on that in the SQLite docs but my recommendation is start small and just add a nonclustered index on(hashband)
or(hashband, file_id)
(you should test your use case both ways to see which one is more performant). Then you can see if clustering on it helps.J.D.– J.D.2021年03月29日 15:54:56 +00:00Commented Mar 29, 2021 at 15:54 -
-
@duhaime Best of luck! Btw I'd be curious what your before and after looks like, in terms of average runtime if you care to share. (I'm a nerd when it comes to optimization metrics, especially for larger sized data. :) Also, with the proper optimizations, SQLite should be able to handle 100 million rows OK enough on actual computers (as opposed to a mobile device) so you should be able to get this to work sufficiently.J.D.– J.D.2021年03月29日 22:13:46 +00:00Commented Mar 29, 2021 at 22:13
-
1Ah thanks for the heads up! I think we'll quickly bump our heads against that 10 GB limit in this project, but this is good to know about for future projects!duhaime– duhaime2021年03月29日 23:43:07 +00:00Commented Mar 29, 2021 at 23:43
Using a window aggregate would presumably be substantially faster than the self-join that you currently have. This is available from version 3.25
Unfortunately, in SQLite, DISTINCT
cannot be used in an OVER
aggregate, but we can simulate it with MAX
of DENSE_RANK
(if file_id
is nullable then it can be a little more complex)
The calculation is quite simple: DENSE_RANK
will give us a ranking, but with tied results, e.g. 1,1,2,3,3,3,4. Then we use MAX
to get the highest of that. PARTITION BY
works like GROUP BY
but for window aggregates, in other words the function is calculated separately over each partition.
WITH file_id_ranked AS (
SELECT *,
DENSE_RANK() OVER (PARTITION BY hashband ORDER BY file_id) rnk
FROM hashbands
),
file_id_counts AS (
SELECT *,
MAX(rnk) OVER (PARTITION BY hashband) AS count
FROM file_id_ranked
)
SELECT hashband, file_id, window_id
FROM file_id_counts
WHERE count > 1
ORDER BY hashband;
I strongly recommend an index on hashband, file_id, window_id
in that order.
-
This looks great! I've never used DENSE_RANK or PARTITION so will need to research them. But
file_id
is not nullable :)duhaime– duhaime2021年03月30日 12:04:19 +00:00Commented Mar 30, 2021 at 12:04 -
1I've elaborated a little on how the calculation worksCharlieface– Charlieface2021年03月30日 12:11:51 +00:00Commented Mar 30, 2021 at 12:11
Explore related questions
See similar questions with these tags.