0

I have a SQLite table named hashbands that looks like this:

hashband | file_id | window_id
------------------------------
potato | 0 | 0
potato | 1 | 0

The table has 100M+ rows (lol). I need to find all hashbands that have two or more distinct file_id values, and then I need to fetch all rows that contain those hashbands and order the results by hashband. Right now I'm using this:

 WITH file_id_counts AS (
 SELECT hashband, COUNT(DISTINCT(file_id)) as count
 FROM hashbands
 GROUP BY hashband
 HAVING COUNT > 1
 ) SELECT hashband, file_id, window_id
 FROM hashbands
 WHERE hashband IN (SELECT hashband from file_id_counts)
 ORDER BY hashband

Does anyone see ways to expedite this query? Any pointers would be helpful!

asked Mar 29, 2021 at 14:25
1
  • Is this question off topic here? Commented Mar 30, 2021 at 12:04

2 Answers 2

2

Likely the majority of the performance issue you're seeing is just the normal constraints of the data limitations of SQLite. Your query is of sound mind, and I don't believe there's much you can do to optimize it other than re-write the WHERE predicate to something more efficiently relational with an INNER JOIN like this:

WITH file_id_counts AS (
 SELECT hashband, COUNT(DISTINCT(file_id)) as count
 FROM hashbands
 GROUP BY hashband
 HAVING COUNT > 1
 ) SELECT hashband, file_id, window_id
 FROM hashbands
 INNER JOIN file_id_counts
 ON hashbands.hashband = file_id_counts.hashband
 ORDER BY hashband

This is logically equivalent and the reasoning it could be faster is because the IN clause is syntactical sugar for a bunch of OR clauses which can be less efficient than a direct single equality operator like the INNER JOIN is doing above.

Additionally, ensuring your hashbands table has an index on at least (hashband) or possibly on (hashband, file_id) should help (if one doesn't exist already).

Finally, if possible, removing your ORDER BY clause, and instead doing the sorting in your consuming application would probably help a small amount as well. While this mostly just moves the onus of the sort to a different part of the call stack, generally sorting in the database can add a little extra complexity that sometimes can be resolved and slightly more efficient to do in the consuming application. Plus, sorting is really presentation logic in my opinion (at least when not used for a functional purpose of the query).

Where does this SQLite database live?...a mobile application, or on its own server?

answered Mar 29, 2021 at 15:20
6
  • Thank you for these comments! This DB is for identifying text reuse and will live on the personal machines of those who use my lab's little software package for identifying text reuse. Ultimately we'll make the DB configurable so that users who are capable of installing & managing a more robust db will be able to do so, but those who are new to programming will be able to use the hassle-free sqlite db. Should I remove the default ROWID and use hashband as the primary key on the hashband table? Seems like that should help... Commented Mar 29, 2021 at 15:38
  • @duhaime Don't think you'll be able to create a primary key on the hashband field since it's not unique. But you'll want to create an index on the hashband field regardless if you use the default ROWID or not, which will be dependent on if you cluster on hashband. You can read more on that in the SQLite docs but my recommendation is start small and just add a nonclustered index on (hashband) or (hashband, file_id) (you should test your use case both ways to see which one is more performant). Then you can see if clustering on it helps. Commented Mar 29, 2021 at 15:54
  • lol ah yes of course Commented Mar 29, 2021 at 20:56
  • @duhaime Best of luck! Btw I'd be curious what your before and after looks like, in terms of average runtime if you care to share. (I'm a nerd when it comes to optimization metrics, especially for larger sized data. :) Also, with the proper optimizations, SQLite should be able to handle 100 million rows OK enough on actual computers (as opposed to a mobile device) so you should be able to get this to work sufficiently. Commented Mar 29, 2021 at 22:13
  • 1
    Ah thanks for the heads up! I think we'll quickly bump our heads against that 10 GB limit in this project, but this is good to know about for future projects! Commented Mar 29, 2021 at 23:43
2

Using a window aggregate would presumably be substantially faster than the self-join that you currently have. This is available from version 3.25

Unfortunately, in SQLite, DISTINCT cannot be used in an OVER aggregate, but we can simulate it with MAX of DENSE_RANK (if file_id is nullable then it can be a little more complex)

The calculation is quite simple: DENSE_RANK will give us a ranking, but with tied results, e.g. 1,1,2,3,3,3,4. Then we use MAX to get the highest of that. PARTITION BY works like GROUP BY but for window aggregates, in other words the function is calculated separately over each partition.

WITH file_id_ranked AS (
 SELECT *,
 DENSE_RANK() OVER (PARTITION BY hashband ORDER BY file_id) rnk
 FROM hashbands
),
file_id_counts AS (
 SELECT *,
 MAX(rnk) OVER (PARTITION BY hashband) AS count
 FROM file_id_ranked
)
SELECT hashband, file_id, window_id
 FROM file_id_counts
 WHERE count > 1
 ORDER BY hashband;

I strongly recommend an index on hashband, file_id, window_id in that order.

answered Mar 30, 2021 at 1:23
2
  • This looks great! I've never used DENSE_RANK or PARTITION so will need to research them. But file_id is not nullable :) Commented Mar 30, 2021 at 12:04
  • 1
    I've elaborated a little on how the calculation works Commented Mar 30, 2021 at 12:11

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.