Speed performance of sqlite3 queries looped in python

Question 1

I have a table data stored in a database ships.db, the data are informations of tracked ships hourly. The table data looks like this.

time | shipId | longitude | latitude
---------------------------------------------------------------------
00:00:00 1 xx.xxxx yy.yyyy
00:00:00 2 xx.xxxx yy.yyyy
00:00:00 3 xx.xxxx yy.yyyy
00:00:00 4 xx.xxxx yy.yyyy
01:00:00 2 xx.xxxx yy.yyyy
01:00:00 4 xx.xxxx yy.yyyy
 ... ... ... ...
23:00:00 4 xx.xxxx yy.yyyy

Splitting the whole earth to a grid of 5-degree width and length for each cell, I would get the number of fetched records hourly per cell of that grid.

Note that the number of records are not the same each hour because some ships are not more live therefore not fetched.

I wrote this code in python, it works but it takes large time because the database has roughly 250000 records. Is there another method or approach to make it better and faster in python?

My script:

import sqlite3
def writeToFile(string, file):
 with open(file,"a") as ouf:
 ouf.write(string+"\n")
output = "report.txt"
with sqlite3.connect("ships.db") as con:
 cur = con.cursor()
 #iterate over times from 0 to 23 (hours)
 for hour in range(0,24): # hours: from 0 to 23
 #make each loop of time in this time format "hh:00:00"
 time = str(hour).zfill(2)+":00:00"
 #scan from longitude -180 (180 W) to +180 (180 E) each 5 degree of longitude
 for longitude in range(-180,180,5):
 #scan from latitude -90 (90 S) to +90 (90 N) each 5 degree of latitude
 for latitude in range(-90,90,5):
 sql = f'''SELECT time, count(*) AS occurence FROM 'data'
 WHERE time ="{time}"
 AND latitude BETWEEN {latitude} AND {latitude+5}
 AND longitude BETWEEN {longitude} AND {longitude+5}
 GROUP BY time'''
 data = cur.execute(sql).fetchone() #fetchone because group by time
 if data != None:
 time, occurence = data
 else: #some cell of grid may have no ship at a hour therefore this else
 occurence = None
 result = [time, occurence, longitude, latitude]
 
 #writing the result to output
 writeToFile("\t".join(result), output)

Question 2

What is your table structure ? Do you have indexes ? Your three nested loops produce a total of 62209 DB calls. I am thinking that maybe a more sophisticated SELECT could retrieve the information you want, or at least half-baked results that would make the job easier.

Question 3

No indexes and that is the structure: CREATE TABLE IF NOT EXISTS 'data' (time TEXT, shipId INTEGER, longitude REAL, latitude REAL);

Question 4

Well then, without any indexes, that means your script does a full table scan to retrieve results, and it does that at every iteration. So yes it's going to be slow. Seems to me that you could add latitude + longitude in your GROUP BY, thus get rid of two loops and simplify things.

Question 5

@Anonymous , I think, I got it, I will take the floor of latitude and longitude then divide by 5, multiply by 5 to get the range start of the cell in which lies each ship then group by longitude and latitude and time. It worked perfectly and so much better than before. But I would test it for edges and exceptional cases before taking it as generalization.

Question 6

It's difficult to say how the database query could be improved without seeing the definition for that table.

Question 7

There are multiple things that I would like to point out to you, just to help you improve coding style. Currently you are not noticing as performance issues as they are hidden due to you DB performance.

Fixed values: You are computing fixed list every time through 'range'. Try to store them in list. As this is unnecessary compute overhead.
File writing: You are opening and closing file per record. This can cause huge disk performance issues once you start noticing. Try to collect record in list and maybe write to file once every 'hour'.
Always try to add 'Indexes' on DB. They are made for purpose of speeding up searching.

Question 8

Carrying over a connection/cursor between tasks without leaking a connection or staying open unnecessarily long is indeed a good idea, but not simple. Would you have an appropriate approach in mind?

Question 9

@Mast, can you please explain you question as OP was i think only doing things as single task.

Question 10

If you like my idea. then i would recommend creating list of all the sql queries in outside the with sqlite3 and later connect to db and execute.

mangupt mangupt 6693 silver badges10 bronze badges · Answer 1 · 2021-04-17 06:50:57Z

There are multiple things that I would like to point out to you, just to help you improve coding style. Currently you are not noticing as performance issues as they are hidden due to you DB performance.

Fixed values: You are computing fixed list every time through 'range'. Try to store them in list. As this is unnecessary compute overhead.
File writing: You are opening and closing file per record. This can cause huge disk performance issues once you start noticing. Try to collect record in list and maybe write to file once every 'hour'.
Always try to add 'Indexes' on DB. They are made for purpose of speeding up searching.

Carrying over a connection/cursor between tasks without leaking a connection or staying open unnecessarily long is indeed a good idea, but not simple. Would you have an appropriate approach in mind?
@Mast, can you please explain you question as OP was i think only doing things as single task.
If you like my idea. then i would recommend creating list of all the sql queries in outside the with sqlite3 and later connect to db and execute.

Stack Exchange Network

Speed performance of sqlite3 queries looped in python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Speed performance of sqlite3 queries looped in python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions