Skip to main content
Code Review

Return to Answer

Addendum: Suggestion for performance
Source Link
200_success
  • 145.5k
  • 22
  • 190
  • 479

Addendum: Suggestion for performance

The following query should run faster than my original suggestion, since it only performs joins by equality on indexed column values.

SELECT first.*
 FROM sequence_info AS first
 JOIN sequence_info AS second
 ON second.sequence_info = first.sequence_info
 AND second.sequence_offset = first.next_offset
 JOIN sequence_info AS third
 ON third.sequence_info = second.sequence_info
 AND third.sequence_offset = second.next_offset
 WHERE first.kmer_length > 2
 ORDER BY (first.next_offset - first.sequence_offset) DESC
 , first.sequence_info
 , first.sequence_offset;

The implementation of such joins should be extremely well optimized in any relational database, since they are very common operations. You probably wouldn't be able to implement anything faster yourself in any language, much less in Python. You might be able to squeeze out better performance by using an INTEGER type for sequence_offset instead of a REAL.

To be able to run that query, you'll have to augment the sequence_info table with a next_offset column...

ALTER TABLE sequence_info ADD COLUMN next_offset REAL;
UPDATE sequence_info SET next_offset = sequence_offset + length(sequence_info);
CREATE UNIQUE INDEX sequence_index ON sequence_info (sequence_offset, sequence_info);
CREATE UNIQUE INDEX next_index ON sequence_info (next_offset, sequence_info);

If you still aren't satisfied with the performance after that change, you would probably have to consider trying another RDMS (such as PostgreSQL), tuning the database, or throwing more RAM/CPU at the problem — in other words, factors beyond the scope of Code Review.

Addendum: Suggestion for performance

The following query should run faster than my original suggestion, since it only performs joins by equality on indexed column values.

SELECT first.*
 FROM sequence_info AS first
 JOIN sequence_info AS second
 ON second.sequence_info = first.sequence_info
 AND second.sequence_offset = first.next_offset
 JOIN sequence_info AS third
 ON third.sequence_info = second.sequence_info
 AND third.sequence_offset = second.next_offset
 WHERE first.kmer_length > 2
 ORDER BY (first.next_offset - first.sequence_offset) DESC
 , first.sequence_info
 , first.sequence_offset;

The implementation of such joins should be extremely well optimized in any relational database, since they are very common operations. You probably wouldn't be able to implement anything faster yourself in any language, much less in Python. You might be able to squeeze out better performance by using an INTEGER type for sequence_offset instead of a REAL.

To be able to run that query, you'll have to augment the sequence_info table with a next_offset column...

ALTER TABLE sequence_info ADD COLUMN next_offset REAL;
UPDATE sequence_info SET next_offset = sequence_offset + length(sequence_info);
CREATE UNIQUE INDEX sequence_index ON sequence_info (sequence_offset, sequence_info);
CREATE UNIQUE INDEX next_index ON sequence_info (next_offset, sequence_info);

If you still aren't satisfied with the performance after that change, you would probably have to consider trying another RDMS (such as PostgreSQL), tuning the database, or throwing more RAM/CPU at the problem — in other words, factors beyond the scope of Code Review.

Renamed columns based on subsequent comment
Source Link
200_success
  • 145.5k
  • 22
  • 190
  • 479

Your SQL is not working as well as you think it is.

Any time you do any significant amount of post-processing on an SQL result set, that is an indicator that your query is weakly formulated. The point of the database is that it lets you query it for exactly the data that you are interested in. The way your code treats the database as a passive storage format, you could just as well have stored everything in a CSV file.

You didn't provide any details about your database schema, so I can only guess that your third column is named position and the fourth column is named genome.(削除) You didn't provide any details about your database schema, so I can only guess that your third column is named position and the fourth column is named genome. (削除ここまで) (Had you explicitly specified which columns you were selecting, instead of just SELECT *, your code would be more self-documenting.) A query such as the following should extract the relevant rows:

SELECT *
 FROM sequence_info AS middle
 JOIN sequence_info AS preceding
 ON preceding.genomesequence_info = middle.genomesequence_info
 AND preceding.positionsequence_offset = middle.positionsequence_offset - length(middle.genomesequence_info)
 JOIN sequence_info AS following
 ON following.genomesequence_info = middle.genomesequence_info
 AND following.positionsequence_offset = middle.positionsequence_offset + length(middle.genomesequence_info)
 WHERE middle.kmer_length > 2
 ORDER BY length(middle.genomesequence_info) DESC, middle.genomesequence_info, middle.position;sequence_offset;

For performance, be sure to create indexes on the genome and position columns of your table.

Your SQL is not working as well as you think it is.

Any time you do any significant amount of post-processing on an SQL result set, that is an indicator that your query is weakly formulated. The point of the database is that it lets you query it for exactly the data that you are interested in. The way your code treats the database as a passive storage format, you could just as well have stored everything in a CSV file.

You didn't provide any details about your database schema, so I can only guess that your third column is named position and the fourth column is named genome. (Had you explicitly specified which columns you were selecting, instead of just SELECT *, your code would be more self-documenting.) A query such as the following should extract the relevant rows:

SELECT *
 FROM sequence_info AS middle
 JOIN sequence_info AS preceding
 ON preceding.genome = middle.genome
 AND preceding.position = middle.position - length(middle.genome)
 JOIN sequence_info AS following
 ON following.genome = middle.genome
 AND following.position = middle.position + length(middle.genome)
 WHERE kmer_length > 2
 ORDER BY length(middle.genome) DESC, middle.genome, middle.position;

For performance, be sure to create indexes on the genome and position columns of your table.

Your SQL is not working as well as you think it is.

Any time you do any significant amount of post-processing on an SQL result set, that is an indicator that your query is weakly formulated. The point of the database is that it lets you query it for exactly the data that you are interested in. The way your code treats the database as a passive storage format, you could just as well have stored everything in a CSV file.

(削除) You didn't provide any details about your database schema, so I can only guess that your third column is named position and the fourth column is named genome. (削除ここまで) (Had you explicitly specified which columns you were selecting, instead of just SELECT *, your code would be more self-documenting.) A query such as the following should extract the relevant rows:

SELECT *
 FROM sequence_info AS middle
 JOIN sequence_info AS preceding
 ON preceding.sequence_info = middle.sequence_info
 AND preceding.sequence_offset = middle.sequence_offset - length(middle.sequence_info)
 JOIN sequence_info AS following
 ON following.sequence_info = middle.sequence_info
 AND following.sequence_offset = middle.sequence_offset + length(middle.sequence_info)
 WHERE middle.kmer_length > 2
 ORDER BY length(middle.sequence_info) DESC, middle.sequence_info, middle.sequence_offset;

For performance, be sure to create indexes on the genome and position columns of your table.

Renamed table aliases
Source Link
200_success
  • 145.5k
  • 22
  • 190
  • 479

Your SQL is not working as well as you think it is.

Any time you do any significant amount of post-processing on an SQL result set, that is an indicator that your query is weakly formulated. The point of the database is that it lets you query it for exactly the data that you are interested in. The way your code treats the database as a passive storage format, you could just as well have stored everything in a CSV file.

You didn't provide any details about your database schema, so I can only guess that your third column is named position and the fourth column is named genome. (Had you explicitly specified which columns you were selecting, instead of just SELECT *, your code would be more self-documenting.) A query such as the following should extract the relevant rows:

SELECT *
 FROM sequence_info AS bmiddle
 JOIN sequence_info AS apreceding
 ON apreceding.genome = bmiddle.genome
 AND apreceding.position = bmiddle.position - length(bmiddle.genome)
 JOIN sequence_info AS cfollowing
 ON cfollowing.genome = bmiddle.genome
 AND cfollowing.position = bmiddle.position + length(bmiddle.genome)
 WHERE kmer_length > 2
 ORDER BY length(bmiddle.genome) DESC, bmiddle.genome;genome, middle.position;

For performance, be sure to create indexes on the genome and position columns of your table.

Your SQL is not working as well as you think it is.

Any time you do any significant amount of post-processing on an SQL result set, that is an indicator that your query is weakly formulated. The point of the database is that it lets you query it for exactly the data that you are interested in. The way your code treats the database as a passive storage format, you could just as well have stored everything in a CSV file.

You didn't provide any details about your database schema, so I can only guess that your third column is named position and the fourth column is named genome. (Had you explicitly specified which columns you were selecting, instead of just SELECT *, your code would be more self-documenting.) A query such as the following should extract the relevant rows:

SELECT *
 FROM sequence_info AS b
 JOIN sequence_info AS a
 ON a.genome = b.genome
 AND a.position = b.position - length(b.genome)
 JOIN sequence_info AS c
 ON c.genome = b.genome
 AND c.position = b.position + length(b.genome)
 WHERE kmer_length > 2
 ORDER BY length(b.genome) DESC, b.genome;

For performance, be sure to create indexes on the genome and position columns of your table.

Your SQL is not working as well as you think it is.

Any time you do any significant amount of post-processing on an SQL result set, that is an indicator that your query is weakly formulated. The point of the database is that it lets you query it for exactly the data that you are interested in. The way your code treats the database as a passive storage format, you could just as well have stored everything in a CSV file.

You didn't provide any details about your database schema, so I can only guess that your third column is named position and the fourth column is named genome. (Had you explicitly specified which columns you were selecting, instead of just SELECT *, your code would be more self-documenting.) A query such as the following should extract the relevant rows:

SELECT *
 FROM sequence_info AS middle
 JOIN sequence_info AS preceding
 ON preceding.genome = middle.genome
 AND preceding.position = middle.position - length(middle.genome)
 JOIN sequence_info AS following
 ON following.genome = middle.genome
 AND following.position = middle.position + length(middle.genome)
 WHERE kmer_length > 2
 ORDER BY length(middle.genome) DESC, middle.genome, middle.position;

For performance, be sure to create indexes on the genome and position columns of your table.

Source Link
200_success
  • 145.5k
  • 22
  • 190
  • 479
Loading
lang-py

AltStyle によって変換されたページ (->オリジナル) /