Optimization for SQLite result set parsing

Question 1

I am retrieving information from an SQLite database that gives me back around 20 million rows that I need to process. This information is then transformed into a dict of lists which I need to use. I am trying to use generators wherever possible.

What optimizations can be done?

I am either getting a "Killed" message or it takes a really long time to run. The SQL result set part is working fine. I tested the generator code in the Python interpreter, and it doesn’t have any problems. I am guessing the problem is with the dict generation.

Update for clarity:

I have 20 million rows in my result set from my SQLite database. Each row is of the form:

(2786972, 486255.0, 4125992.0, 'AACAGA', '2005’)

I now need to create a dict that is keyed with the fourth element ‘AACAGA’ of the row. The value that the dict will hold is the third element, but it has to hold the values for all the occurences in the result set.

So, in our case here, ‘AACAGA’ will hold a list containing multiple values from the SQL result set. The problem here is to find tandem repeats in a genome sequence. A tandem repeat is a genome read (‘AACAGA’) that is repeated at least three times in succession.

For me to calculate this, I need all the values in the third index as a list keyed by the genome read, in our case ‘AACAGA’. Once I have the list, I can subtract successive values in the list to see if there are three consecutive matches to the length of the read. This is what I aim to accomplish with the dictionary and lists as values.

#!/usr/bin/python3.3
import sqlite3 as sql
sequence_dict = {}
tandem_repeat = {}
def dict_generator(large_dict):
 dkeys = large_dict.keys()
 for k in dkeys:
 yield(k, large_dict[k])
 def create_result_generator():
 conn = sql.connect('sequences_mt_test.sqlite', timeout=20)
 c = conn.cursor()
 try:
 conn.row_factory = sql.Row
 sql_string = "select * from sequence_info where kmer_length > 2"
 c.execute(sql_string)
 except sql.Error as error:
 print("Error retrieving information from the database : ", error.args[0])
 result_set = c.fetchall()
 if result_set:
 conn.close()
 return(row for row in result_set)
 def find_longest_tandem_repeat():
 sortList = []
 for entry in create_result_generator():
 sequence_dict.setdefault(entry[3], []).append(entry[2])
 for key,value in dict_generator(sequence_dict):
 sortList = sorted(value)
 for i in range (0, (len(sortList)-1)):
 if((sortList[i+1]-sortList[i]) == (sortList[i+2]-sortList[i+1])
 == (sortList[i+3]-sortList[i+2]) == (len(key))):
 tandem_repeat[key] = True
 break
 print(max(k for k, v in tandem_repeat.items() if v))
 if __name__ == "__main__":
 find_longest_tandem_repeat()

Question 2

Despite your comment to my answer, I can still spot errors in your code. Can you please provide a working version of your code and some tests ?

Question 3

@Josay: The problem is that I can’t point out whether this code is working properly or not. I have tested each part of this on small sets and the logic is correct, but when put together, because of the large result set, it is taking a long time to run. If it isn’t taking a long time, I get a "Killed" message, mostly due to memory overflow, I am guessing? So I don’t know how to give you test results. The create_result_generator() part works perfectly fine after a long-ish delay for retrieval and then is very fast. Other logic is fine too.

Question 4

Please double-check your code — try without except? Mismatched [?

Question 5

Rev 5 is still broken. Please ensure that your code works when asking for a review.

Question 6

@200_success - My code is working, as in, I am not getting any syntactical errors. I did not copy paste my code here after the first go. I was just editing it in place here to incorporate all comments. I am making sure it is in place where I am running it. Sorry about that.

Question 7

Your SQL is not working as well as you think it is.

Any time you do any significant amount of post-processing on an SQL result set, that is an indicator that your query is weakly formulated. The point of the database is that it lets you query it for exactly the data that you are interested in. The way your code treats the database as a passive storage format, you could just as well have stored everything in a CSV file.

~~(削除) You didn't provide any details about your database schema, so I can only guess that your third column is named position and the fourth column is named genome. (削除ここまで)~~ (Had you explicitly specified which columns you were selecting, instead of just SELECT *, your code would be more self-documenting.) A query such as the following should extract the relevant rows:

SELECT *
 FROM sequence_info AS middle
 JOIN sequence_info AS preceding
 ON preceding.sequence_info = middle.sequence_info
 AND preceding.sequence_offset = middle.sequence_offset - length(middle.sequence_info)
 JOIN sequence_info AS following
 ON following.sequence_info = middle.sequence_info
 AND following.sequence_offset = middle.sequence_offset + length(middle.sequence_info)
 WHERE middle.kmer_length > 2
 ORDER BY length(middle.sequence_info) DESC, middle.sequence_info, middle.sequence_offset;

For performance, be sure to create indexes on the genome and position columns of your table.

Addendum: Suggestion for performance

The following query should run faster than my original suggestion, since it only performs joins by equality on indexed column values.

SELECT first.*
 FROM sequence_info AS first
 JOIN sequence_info AS second
 ON second.sequence_info = first.sequence_info
 AND second.sequence_offset = first.next_offset
 JOIN sequence_info AS third
 ON third.sequence_info = second.sequence_info
 AND third.sequence_offset = second.next_offset
 WHERE first.kmer_length > 2
 ORDER BY (first.next_offset - first.sequence_offset) DESC
 , first.sequence_info
 , first.sequence_offset;

The implementation of such joins should be extremely well optimized in any relational database, since they are very common operations. You probably wouldn't be able to implement anything faster yourself in any language, much less in Python. You might be able to squeeze out better performance by using an INTEGER type for sequence_offset instead of a REAL.

To be able to run that query, you'll have to augment the sequence_info table with a next_offset column...

ALTER TABLE sequence_info ADD COLUMN next_offset REAL;
UPDATE sequence_info SET next_offset = sequence_offset + length(sequence_info);
CREATE UNIQUE INDEX sequence_index ON sequence_info (sequence_offset, sequence_info);
CREATE UNIQUE INDEX next_index ON sequence_info (next_offset, sequence_info);

If you still aren't satisfied with the performance after that change, you would probably have to consider trying another RDMS (such as PostgreSQL), tuning the database, or throwing more RAM/CPU at the problem — in other words, factors beyond the scope of Code Review.

Question 8

@success_200: I apologize for that. Yes, I missed out providing the schema. The third column is sequence_offset(position that you mentioned) and the fourth column is sequence_info(I made a mistake when naming the column). I am just getting into complex SQL and hence don’t have the knowledge to write queries such as these and hence was relying on post processing to do my job. I will test this query and get back on that. Thanks a ton for the help. It was really insightful.

Question 9

I made the changes that you suggested and the query is working fine on small data sets. I was just testing it to see if the logic was working of course. I ran it on my original result set and it is taking a long time to complete. This was after I built the indexes on the columns as you suggested. The query has been running for 1.5 hours as of now. Does it usual take that long for a query such as this to run? Any other optimizations on SQL that I can do? Or is the SQLite database not suitable for such things? Would a CSV file parsed into a dict work better in this case?

Question 10

SQL is perfectly suited to these kinds of queries, though SQLite might not be your best bet for performance. See the addendum to my answer.

Question 11

@200_success postgres will be slower than sqlite no matter what you do given it has TCP overhead

Question 12

I have missed the whole point of your program but these comments might be useful to you anyway :

dict_generator

I might be wrong but dict_generator(large_dict) looks like large_dict.iteritems() (Python 2) / large_dict.items() (Python 3).

create_result_generator

 while True:
 result_set = c.fetchall()
 if not result_set:
 break
 else:
 return(row for row in result_set)

This loop does not seem very loopy to me. What about :

 result_set = c.fetchall()
 if result_set:
 return(row for row in result_set)

Also, it seems to hilight a bigger issue about the fact that the connection might not be closed.

find_longest_tandem_repeat

In :

 try:
 sequence_dict[row[3]].append(row[2])
 except KeyError:
 sequence_dict[row[3]] = []
 sequence_dict[row[3]].append(row[2])

seems to be equivalent to

 sequence_dict.setdefault(row[3],[]).append(row[2])

Also, I don't know where row is coming from at this stage.

In :

 for key,value in dict_generator(sequence_dict):
 sortList = sorted(value)
 for i in range (0, (len(sortList)-1)):
 if((sorList[i+1]-sorList[i]) == (sorList[i+2]-sorList[i+1])
 == (sorList[i+3]-sorList[i+2]) == (len(key))):
 tandem_repeat[key] = True
 sortList = []

Naming (sortList/sorList) is either wrong or very confusing. There is no point in doing sortList = []. You can probably break once you've done tandem_repeat[key] = True because new iterations won't do much anyway.

 print(max(k for k in tandem_repeat.keys() if tandem_repeat[k] is True))

might as well be :

 print(max(k for k,v in tandem_repeat.items() if v))

Question 13

I have edited the post to provide more clarification and use case scenarios. I have made most of the changes that you have suggested. Sorry about the typos.

Question 14

Seems better indeed :-)

Question 15

Josay: About dict_generator - I don’t know if this link is correct or not. That is what I used for the large dict.

Question 16

I would expect this code to raise an IndexError when i equals len(sortList)-2.

 for i in range (0, (len(sortList)-1)):
 if((sortList[i+1]-sortList[i]) == (sortList[i+2]-sortList[i+1])
 == (sortList[i+3]-sortList[i+2]) == (len(key))):
 tandem_repeat[key] = True
 break

Your function is named find_longest_tandem_repeat(), but I think it actually prints the tandem sequence that would occur last when sorted alphabetically:

print(max(k for k, v in tandem_repeat.items() if v))

200_success 200_success 145k22 gold badges190 silver badges478 bronze badges · Accepted Answer · 2014-02-13 15:34:01Z

Your SQL is not working as well as you think it is.

Any time you do any significant amount of post-processing on an SQL result set, that is an indicator that your query is weakly formulated. The point of the database is that it lets you query it for exactly the data that you are interested in. The way your code treats the database as a passive storage format, you could just as well have stored everything in a CSV file.

~~(削除) You didn't provide any details about your database schema, so I can only guess that your third column is named position and the fourth column is named genome. (削除ここまで)~~ (Had you explicitly specified which columns you were selecting, instead of just SELECT *, your code would be more self-documenting.) A query such as the following should extract the relevant rows:

SELECT *
 FROM sequence_info AS middle
 JOIN sequence_info AS preceding
 ON preceding.sequence_info = middle.sequence_info
 AND preceding.sequence_offset = middle.sequence_offset - length(middle.sequence_info)
 JOIN sequence_info AS following
 ON following.sequence_info = middle.sequence_info
 AND following.sequence_offset = middle.sequence_offset + length(middle.sequence_info)
 WHERE middle.kmer_length > 2
 ORDER BY length(middle.sequence_info) DESC, middle.sequence_info, middle.sequence_offset;

For performance, be sure to create indexes on the genome and position columns of your table.

Addendum: Suggestion for performance

The following query should run faster than my original suggestion, since it only performs joins by equality on indexed column values.

SELECT first.*
 FROM sequence_info AS first
 JOIN sequence_info AS second
 ON second.sequence_info = first.sequence_info
 AND second.sequence_offset = first.next_offset
 JOIN sequence_info AS third
 ON third.sequence_info = second.sequence_info
 AND third.sequence_offset = second.next_offset
 WHERE first.kmer_length > 2
 ORDER BY (first.next_offset - first.sequence_offset) DESC
 , first.sequence_info
 , first.sequence_offset;

The implementation of such joins should be extremely well optimized in any relational database, since they are very common operations. You probably wouldn't be able to implement anything faster yourself in any language, much less in Python. You might be able to squeeze out better performance by using an INTEGER type for sequence_offset instead of a REAL.

To be able to run that query, you'll have to augment the sequence_info table with a next_offset column...

ALTER TABLE sequence_info ADD COLUMN next_offset REAL;
UPDATE sequence_info SET next_offset = sequence_offset + length(sequence_info);
CREATE UNIQUE INDEX sequence_index ON sequence_info (sequence_offset, sequence_info);
CREATE UNIQUE INDEX next_index ON sequence_info (next_offset, sequence_info);

If you still aren't satisfied with the performance after that change, you would probably have to consider trying another RDMS (such as PostgreSQL), tuning the database, or throwing more RAM/CPU at the problem — in other words, factors beyond the scope of Code Review.

@success_200: I apologize for that. Yes, I missed out providing the schema. The third column is sequence_offset(position that you mentioned) and the fourth column is sequence_info(I made a mistake when naming the column). I am just getting into complex SQL and hence don’t have the knowledge to write queries such as these and hence was relying on post processing to do my job. I will test this query and get back on that. Thanks a ton for the help. It was really insightful.
I made the changes that you suggested and the query is working fine on small data sets. I was just testing it to see if the logic was working of course. I ran it on my original result set and it is taking a long time to complete. This was after I built the indexes on the columns as you suggested. The query has been running for 1.5 hours as of now. Does it usual take that long for a query such as this to run? Any other optimizations on SQL that I can do? Or is the SQLite database not suitable for such things? Would a CSV file parsed into a dict work better in this case?
SQL is perfectly suited to these kinds of queries, though SQLite might not be your best bet for performance. See the addendum to my answer.
@200_success postgres will be slower than sqlite no matter what you do given it has TCP overhead

Stack Exchange Network

Optimization for SQLite result set parsing

3 Answers 3

Addendum: Suggestion for performance

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Optimization for SQLite result set parsing

3 Answers 3

Addendum: Suggestion for performance

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions