I am retrieving information from an SQLite database that gives me back around 20 million rows that I need to process. This information is then transformed into a dict of lists which I need to use. I am trying to use generators wherever possible.
What optimizations can be done?
I am either getting a "Killed" message or it takes a really long time to run. The SQL result set part is working fine. I tested the generator code in the Python interpreter, and it doesn’t have any problems. I am guessing the problem is with the dict generation.
Update for clarity:
I have 20 million rows in my result set from my SQLite database. Each row is of the form:
(2786972, 486255.0, 4125992.0, 'AACAGA', '2005’)
I now need to create a dict that is keyed with the fourth element ‘AACAGA’ of the row. The value that the dict will hold is the third element, but it has to hold the values for all the occurences in the result set.
So, in our case here, ‘AACAGA’ will hold a list containing multiple values from the SQL result set. The problem here is to find tandem repeats in a genome sequence. A tandem repeat is a genome read (‘AACAGA’) that is repeated at least three times in succession.
For me to calculate this, I need all the values in the third index as a list keyed by the genome read, in our case ‘AACAGA’. Once I have the list, I can subtract successive values in the list to see if there are three consecutive matches to the length of the read. This is what I aim to accomplish with the dictionary and lists as values.
#!/usr/bin/python3.3
import sqlite3 as sql
sequence_dict = {}
tandem_repeat = {}
def dict_generator(large_dict):
dkeys = large_dict.keys()
for k in dkeys:
yield(k, large_dict[k])
def create_result_generator():
conn = sql.connect('sequences_mt_test.sqlite', timeout=20)
c = conn.cursor()
try:
conn.row_factory = sql.Row
sql_string = "select * from sequence_info where kmer_length > 2"
c.execute(sql_string)
except sql.Error as error:
print("Error retrieving information from the database : ", error.args[0])
result_set = c.fetchall()
if result_set:
conn.close()
return(row for row in result_set)
def find_longest_tandem_repeat():
sortList = []
for entry in create_result_generator():
sequence_dict.setdefault(entry[3], []).append(entry[2])
for key,value in dict_generator(sequence_dict):
sortList = sorted(value)
for i in range (0, (len(sortList)-1)):
if((sortList[i+1]-sortList[i]) == (sortList[i+2]-sortList[i+1])
== (sortList[i+3]-sortList[i+2]) == (len(key))):
tandem_repeat[key] = True
break
print(max(k for k, v in tandem_repeat.items() if v))
if __name__ == "__main__":
find_longest_tandem_repeat()
3 Answers 3
Your SQL is not working as well as you think it is.
Any time you do any significant amount of post-processing on an SQL result set, that is an indicator that your query is weakly formulated. The point of the database is that it lets you query it for exactly the data that you are interested in. The way your code treats the database as a passive storage format, you could just as well have stored everything in a CSV file.
(削除) You didn't provide any details about your database schema, so I can only guess that your third column is named (Had you explicitly specified which columns you were selecting, instead of just position
and the fourth column is named genome
. (削除ここまで)SELECT *
, your code would be more self-documenting.) A query such as the following should extract the relevant rows:
SELECT *
FROM sequence_info AS middle
JOIN sequence_info AS preceding
ON preceding.sequence_info = middle.sequence_info
AND preceding.sequence_offset = middle.sequence_offset - length(middle.sequence_info)
JOIN sequence_info AS following
ON following.sequence_info = middle.sequence_info
AND following.sequence_offset = middle.sequence_offset + length(middle.sequence_info)
WHERE middle.kmer_length > 2
ORDER BY length(middle.sequence_info) DESC, middle.sequence_info, middle.sequence_offset;
For performance, be sure to create indexes on the genome
and position
columns of your table.
Addendum: Suggestion for performance
The following query should run faster than my original suggestion, since it only performs joins by equality on indexed column values.
SELECT first.*
FROM sequence_info AS first
JOIN sequence_info AS second
ON second.sequence_info = first.sequence_info
AND second.sequence_offset = first.next_offset
JOIN sequence_info AS third
ON third.sequence_info = second.sequence_info
AND third.sequence_offset = second.next_offset
WHERE first.kmer_length > 2
ORDER BY (first.next_offset - first.sequence_offset) DESC
, first.sequence_info
, first.sequence_offset;
The implementation of such joins should be extremely well optimized in any relational database, since they are very common operations. You probably wouldn't be able to implement anything faster yourself in any language, much less in Python. You might be able to squeeze out better performance by using an INTEGER
type for sequence_offset
instead of a REAL
.
To be able to run that query, you'll have to augment the sequence_info
table with a next_offset
column...
ALTER TABLE sequence_info ADD COLUMN next_offset REAL;
UPDATE sequence_info SET next_offset = sequence_offset + length(sequence_info);
CREATE UNIQUE INDEX sequence_index ON sequence_info (sequence_offset, sequence_info);
CREATE UNIQUE INDEX next_index ON sequence_info (next_offset, sequence_info);
If you still aren't satisfied with the performance after that change, you would probably have to consider trying another RDMS (such as PostgreSQL), tuning the database, or throwing more RAM/CPU at the problem — in other words, factors beyond the scope of Code Review.
-
\$\begingroup\$ @success_200: I apologize for that. Yes, I missed out providing the schema. The third column is sequence_offset(position that you mentioned) and the fourth column is sequence_info(I made a mistake when naming the column). I am just getting into complex SQL and hence don’t have the knowledge to write queries such as these and hence was relying on post processing to do my job. I will test this query and get back on that. Thanks a ton for the help. It was really insightful. \$\endgroup\$adwaraki– adwaraki2014年02月13日 15:47:36 +00:00Commented Feb 13, 2014 at 15:47
-
\$\begingroup\$ I made the changes that you suggested and the query is working fine on small data sets. I was just testing it to see if the logic was working of course. I ran it on my original result set and it is taking a long time to complete. This was after I built the indexes on the columns as you suggested. The query has been running for 1.5 hours as of now. Does it usual take that long for a query such as this to run? Any other optimizations on SQL that I can do? Or is the SQLite database not suitable for such things? Would a CSV file parsed into a dict work better in this case? \$\endgroup\$adwaraki– adwaraki2014年02月13日 20:18:32 +00:00Commented Feb 13, 2014 at 20:18
-
\$\begingroup\$ SQL is perfectly suited to these kinds of queries, though SQLite might not be your best bet for performance. See the addendum to my answer. \$\endgroup\$200_success– 200_success2014年02月15日 09:50:18 +00:00Commented Feb 15, 2014 at 9:50
-
\$\begingroup\$ @200_success postgres will be slower than sqlite no matter what you do given it has TCP overhead \$\endgroup\$PirateApp– PirateApp2018年06月26日 00:57:21 +00:00Commented Jun 26, 2018 at 0:57
I have missed the whole point of your program but these comments might be useful to you anyway :
dict_generator
I might be wrong but dict_generator(large_dict)
looks like large_dict.iteritems()
(Python 2) / large_dict.items()
(Python 3).
create_result_generator
while True:
result_set = c.fetchall()
if not result_set:
break
else:
return(row for row in result_set)
This loop does not seem very loopy to me. What about :
result_set = c.fetchall()
if result_set:
return(row for row in result_set)
Also, it seems to hilight a bigger issue about the fact that the connection might not be closed.
find_longest_tandem_repeat
In :
try:
sequence_dict[row[3]].append(row[2])
except KeyError:
sequence_dict[row[3]] = []
sequence_dict[row[3]].append(row[2])
seems to be equivalent to
sequence_dict.setdefault(row[3],[]).append(row[2])
Also, I don't know where row
is coming from at this stage.
In :
for key,value in dict_generator(sequence_dict):
sortList = sorted(value)
for i in range (0, (len(sortList)-1)):
if((sorList[i+1]-sorList[i]) == (sorList[i+2]-sorList[i+1])
== (sorList[i+3]-sorList[i+2]) == (len(key))):
tandem_repeat[key] = True
sortList = []
Naming (sortList
/sorList
) is either wrong or very confusing.
There is no point in doing sortList = []
.
You can probably break
once you've done tandem_repeat[key] = True
because new iterations won't do much anyway.
print(max(k for k in tandem_repeat.keys() if tandem_repeat[k] is True))
might as well be :
print(max(k for k,v in tandem_repeat.items() if v))
-
\$\begingroup\$ I have edited the post to provide more clarification and use case scenarios. I have made most of the changes that you have suggested. Sorry about the typos. \$\endgroup\$adwaraki– adwaraki2014年02月13日 14:09:57 +00:00Commented Feb 13, 2014 at 14:09
-
\$\begingroup\$ Seems better indeed :-) \$\endgroup\$SylvainD– SylvainD2014年02月13日 14:11:19 +00:00Commented Feb 13, 2014 at 14:11
-
I would expect this code to raise an IndexError
when i
equals len(sortList)-2
.
for i in range (0, (len(sortList)-1)):
if((sortList[i+1]-sortList[i]) == (sortList[i+2]-sortList[i+1])
== (sortList[i+3]-sortList[i+2]) == (len(key))):
tandem_repeat[key] = True
break
Your function is named find_longest_tandem_repeat()
, but I think it actually prints the tandem sequence that would occur last when sorted alphabetically:
print(max(k for k, v in tandem_repeat.items() if v))
Explore related questions
See similar questions with these tags.
try
withoutexcept
? Mismatched[
? \$\endgroup\$