I explained a similar situation with plain text files on Grep huge number of patterns from huge file. Many people there said I should, so now I'm migrating my data to a sqlite database:
I have a file from which I extract about 10,000 patterns. Then I check if the database doesn't contain such patterns. If it doesn't, I need to save them externally in file
for further processing:
for id in $(grep ^[0-9] keys); do
if [[ -z $(sqlite3 db.sqlite "select id from main where id = $id") ]]; then
echo $id >>file
fi
done
Since I'm new to SQL, I couldn't find a simple way to do this. Also, this loop is useless as it is 20 times slower than what I achieved with awk
on the mentioned URL.
Since the database is huge, keeps growing, and I run this loop very frequently, is it possible to make this faster?
3 Answers 3
For each pattern, you're invoking a new instance of the sqlite
program which connects to the database anew. That's a waste. You should build a single query that looks for any of the keys, then execute that one query. Database clients are good at executing large queries.
If the matching lines in the keys
file only contain digits, then you can build the query as follows:
{
echo 'select id from main where id in (';
<keys grep -x '[0-9][0-9]*' | # retain only lines containing only digits
sed -e '1! s/^/, /' | # add ", " at the beginning of every line except the first
echo ');'
} | sqlite3 db.sqlite
For more general input data, you get the idea: use text transformations to build a single large query. Be careful to validate your input; here we make sure that what gets injected into the query is syntactically valid. There's actually a corner case in the example above: if there is no match in the file, then the SQL syntax is invalid; if that might happen, you'll need to treat this case specially. Here's more complex code that takes care of the empty case:
<keys grep -x '[0-9][0-9]*' |
if read first; then {
echo 'select id from main where id in (' "$first"
sed -e 's/^/, /'
echo ');'
} | sqlite3 db.sqlite
fi
First things first, you really replace the if
with a list. Actually I would even replace the [[]]
s with []
s, and then run in dash
or other lighter sh
. This even seems simple enough to ditch the entire for
, and run with xargs
(always my preference, better performance) So for example, maybe something like this ...
grep ^[0-9] keys | xargs -P0 -I '{id}' \
sh -c '[ -z "$(sqlite3 db.sqlite =\"select id from main where id = '{id}'\")" ] && \
echo '{id}' >> file'
My escaping is most likely off, but this should point you in the right direction.
I would suspect this would perform MUCH faster, least because you would be running parallel via -P
.
If for some reason even this is crawling, you could always look into something that dumps the sqlite db. You would likely be writing a script, if you took that approach. I would only consider it, if it was necessary.
Import your IDs into some new table, then use that table to query your main
table. Drop the table whet you're done with it.
{ echo id; grep '^[0-9]' keys; } |
sqlite3 database.db \
'CREATE TABLE ids ( id INTEGER UNIQUE )' \
'.import /dev/stdin ids' \
'SELECT * FROM main NATURAL JOIN ids' \
'DROP TABLE ids'
Testing:
sqlite> .mode box
sqlite> SELECT * FROM main;
┌────┬─────────────────┐
│ id │ word │
├────┼─────────────────┤
│ 1 │ concessionaire │
│ 2 │ goniometrically │
│ 3 │ meshed │
│ 4 │ Celtic │
│ 5 │ guiltless │
│ 6 │ sclerodactylia │
│ 7 │ spiritism │
│ 8 │ ratchel │
│ 9 │ Bajau │
│ 10 │ semimineral │
└────┴─────────────────┘
$ cat keys
3
7
78
190
Running the given command would output the following (select id
instead of *
to only output the IDs):
3|meshed
7|spiritism
You must log in to answer this question.
Explore related questions
See similar questions with these tags.