Addendum: Suggestion for performance
The following query should run faster than my original suggestion, since it only performs joins by equality on indexed column values.
SELECT first.*
FROM sequence_info AS first
JOIN sequence_info AS second
ON second.sequence_info = first.sequence_info
AND second.sequence_offset = first.next_offset
JOIN sequence_info AS third
ON third.sequence_info = second.sequence_info
AND third.sequence_offset = second.next_offset
WHERE first.kmer_length > 2
ORDER BY (first.next_offset - first.sequence_offset) DESC
, first.sequence_info
, first.sequence_offset;
The implementation of such joins should be extremely well optimized in any relational database, since they are very common operations. You probably wouldn't be able to implement anything faster yourself in any language, much less in Python. You might be able to squeeze out better performance by using an INTEGER
type for sequence_offset
instead of a REAL
.
To be able to run that query, you'll have to augment the sequence_info
table with a next_offset
column...
ALTER TABLE sequence_info ADD COLUMN next_offset REAL;
UPDATE sequence_info SET next_offset = sequence_offset + length(sequence_info);
CREATE UNIQUE INDEX sequence_index ON sequence_info (sequence_offset, sequence_info);
CREATE UNIQUE INDEX next_index ON sequence_info (next_offset, sequence_info);
If you still aren't satisfied with the performance after that change, you would probably have to consider trying another RDMS (such as PostgreSQL), tuning the database, or throwing more RAM/CPU at the problem — in other words, factors beyond the scope of Code Review.
Addendum: Suggestion for performance
The following query should run faster than my original suggestion, since it only performs joins by equality on indexed column values.
SELECT first.*
FROM sequence_info AS first
JOIN sequence_info AS second
ON second.sequence_info = first.sequence_info
AND second.sequence_offset = first.next_offset
JOIN sequence_info AS third
ON third.sequence_info = second.sequence_info
AND third.sequence_offset = second.next_offset
WHERE first.kmer_length > 2
ORDER BY (first.next_offset - first.sequence_offset) DESC
, first.sequence_info
, first.sequence_offset;
The implementation of such joins should be extremely well optimized in any relational database, since they are very common operations. You probably wouldn't be able to implement anything faster yourself in any language, much less in Python. You might be able to squeeze out better performance by using an INTEGER
type for sequence_offset
instead of a REAL
.
To be able to run that query, you'll have to augment the sequence_info
table with a next_offset
column...
ALTER TABLE sequence_info ADD COLUMN next_offset REAL;
UPDATE sequence_info SET next_offset = sequence_offset + length(sequence_info);
CREATE UNIQUE INDEX sequence_index ON sequence_info (sequence_offset, sequence_info);
CREATE UNIQUE INDEX next_index ON sequence_info (next_offset, sequence_info);
If you still aren't satisfied with the performance after that change, you would probably have to consider trying another RDMS (such as PostgreSQL), tuning the database, or throwing more RAM/CPU at the problem — in other words, factors beyond the scope of Code Review.
Your SQL is not working as well as you think it is.
Any time you do any significant amount of post-processing on an SQL result set, that is an indicator that your query is weakly formulated. The point of the database is that it lets you query it for exactly the data that you are interested in. The way your code treats the database as a passive storage format, you could just as well have stored everything in a CSV file.
You didn't provide any details about your database schema, so I can only guess that your third column is named position
and the fourth column is named genome
.(削除) You didn't provide any details about your database schema, so I can only guess that your third column is named (Had you explicitly specified which columns you were selecting, instead of just position
and the fourth column is named genome
. (削除ここまで)SELECT *
, your code would be more self-documenting.) A query such as the following should extract the relevant rows:
SELECT *
FROM sequence_info AS middle
JOIN sequence_info AS preceding
ON preceding.genomesequence_info = middle.genomesequence_info
AND preceding.positionsequence_offset = middle.positionsequence_offset - length(middle.genomesequence_info)
JOIN sequence_info AS following
ON following.genomesequence_info = middle.genomesequence_info
AND following.positionsequence_offset = middle.positionsequence_offset + length(middle.genomesequence_info)
WHERE middle.kmer_length > 2
ORDER BY length(middle.genomesequence_info) DESC, middle.genomesequence_info, middle.position;sequence_offset;
For performance, be sure to create indexes on the genome
and position
columns of your table.
Your SQL is not working as well as you think it is.
Any time you do any significant amount of post-processing on an SQL result set, that is an indicator that your query is weakly formulated. The point of the database is that it lets you query it for exactly the data that you are interested in. The way your code treats the database as a passive storage format, you could just as well have stored everything in a CSV file.
You didn't provide any details about your database schema, so I can only guess that your third column is named position
and the fourth column is named genome
. (Had you explicitly specified which columns you were selecting, instead of just SELECT *
, your code would be more self-documenting.) A query such as the following should extract the relevant rows:
SELECT *
FROM sequence_info AS middle
JOIN sequence_info AS preceding
ON preceding.genome = middle.genome
AND preceding.position = middle.position - length(middle.genome)
JOIN sequence_info AS following
ON following.genome = middle.genome
AND following.position = middle.position + length(middle.genome)
WHERE kmer_length > 2
ORDER BY length(middle.genome) DESC, middle.genome, middle.position;
For performance, be sure to create indexes on the genome
and position
columns of your table.
Your SQL is not working as well as you think it is.
Any time you do any significant amount of post-processing on an SQL result set, that is an indicator that your query is weakly formulated. The point of the database is that it lets you query it for exactly the data that you are interested in. The way your code treats the database as a passive storage format, you could just as well have stored everything in a CSV file.
(削除) You didn't provide any details about your database schema, so I can only guess that your third column is named (Had you explicitly specified which columns you were selecting, instead of just position
and the fourth column is named genome
. (削除ここまで)SELECT *
, your code would be more self-documenting.) A query such as the following should extract the relevant rows:
SELECT *
FROM sequence_info AS middle
JOIN sequence_info AS preceding
ON preceding.sequence_info = middle.sequence_info
AND preceding.sequence_offset = middle.sequence_offset - length(middle.sequence_info)
JOIN sequence_info AS following
ON following.sequence_info = middle.sequence_info
AND following.sequence_offset = middle.sequence_offset + length(middle.sequence_info)
WHERE middle.kmer_length > 2
ORDER BY length(middle.sequence_info) DESC, middle.sequence_info, middle.sequence_offset;
For performance, be sure to create indexes on the genome
and position
columns of your table.
Your SQL is not working as well as you think it is.
Any time you do any significant amount of post-processing on an SQL result set, that is an indicator that your query is weakly formulated. The point of the database is that it lets you query it for exactly the data that you are interested in. The way your code treats the database as a passive storage format, you could just as well have stored everything in a CSV file.
You didn't provide any details about your database schema, so I can only guess that your third column is named position
and the fourth column is named genome
. (Had you explicitly specified which columns you were selecting, instead of just SELECT *
, your code would be more self-documenting.) A query such as the following should extract the relevant rows:
SELECT *
FROM sequence_info AS bmiddle
JOIN sequence_info AS apreceding
ON apreceding.genome = bmiddle.genome
AND apreceding.position = bmiddle.position - length(bmiddle.genome)
JOIN sequence_info AS cfollowing
ON cfollowing.genome = bmiddle.genome
AND cfollowing.position = bmiddle.position + length(bmiddle.genome)
WHERE kmer_length > 2
ORDER BY length(bmiddle.genome) DESC, bmiddle.genome;genome, middle.position;
For performance, be sure to create indexes on the genome
and position
columns of your table.
Your SQL is not working as well as you think it is.
Any time you do any significant amount of post-processing on an SQL result set, that is an indicator that your query is weakly formulated. The point of the database is that it lets you query it for exactly the data that you are interested in. The way your code treats the database as a passive storage format, you could just as well have stored everything in a CSV file.
You didn't provide any details about your database schema, so I can only guess that your third column is named position
and the fourth column is named genome
. (Had you explicitly specified which columns you were selecting, instead of just SELECT *
, your code would be more self-documenting.) A query such as the following should extract the relevant rows:
SELECT *
FROM sequence_info AS b
JOIN sequence_info AS a
ON a.genome = b.genome
AND a.position = b.position - length(b.genome)
JOIN sequence_info AS c
ON c.genome = b.genome
AND c.position = b.position + length(b.genome)
WHERE kmer_length > 2
ORDER BY length(b.genome) DESC, b.genome;
For performance, be sure to create indexes on the genome
and position
columns of your table.
Your SQL is not working as well as you think it is.
Any time you do any significant amount of post-processing on an SQL result set, that is an indicator that your query is weakly formulated. The point of the database is that it lets you query it for exactly the data that you are interested in. The way your code treats the database as a passive storage format, you could just as well have stored everything in a CSV file.
You didn't provide any details about your database schema, so I can only guess that your third column is named position
and the fourth column is named genome
. (Had you explicitly specified which columns you were selecting, instead of just SELECT *
, your code would be more self-documenting.) A query such as the following should extract the relevant rows:
SELECT *
FROM sequence_info AS middle
JOIN sequence_info AS preceding
ON preceding.genome = middle.genome
AND preceding.position = middle.position - length(middle.genome)
JOIN sequence_info AS following
ON following.genome = middle.genome
AND following.position = middle.position + length(middle.genome)
WHERE kmer_length > 2
ORDER BY length(middle.genome) DESC, middle.genome, middle.position;
For performance, be sure to create indexes on the genome
and position
columns of your table.