I have a table data
stored in a database ships.db
,
the data are informations of tracked ships hourly.
The table data looks like this.
time | shipId | longitude | latitude
---------------------------------------------------------------------
00:00:00 1 xx.xxxx yy.yyyy
00:00:00 2 xx.xxxx yy.yyyy
00:00:00 3 xx.xxxx yy.yyyy
00:00:00 4 xx.xxxx yy.yyyy
01:00:00 2 xx.xxxx yy.yyyy
01:00:00 4 xx.xxxx yy.yyyy
... ... ... ...
23:00:00 4 xx.xxxx yy.yyyy
Splitting the whole earth to a grid of 5-degree width and length for each cell, I would get the number of fetched records hourly per cell of that grid.
Note that the number of records are not the same each hour because some ships are not more live therefore not fetched.
I wrote this code in python, it works but it takes large time because the database has roughly 250000
records.
Is there another method or approach to make it better and faster in python?
My script:
import sqlite3
def writeToFile(string, file):
with open(file,"a") as ouf:
ouf.write(string+"\n")
output = "report.txt"
with sqlite3.connect("ships.db") as con:
cur = con.cursor()
#iterate over times from 0 to 23 (hours)
for hour in range(0,24): # hours: from 0 to 23
#make each loop of time in this time format "hh:00:00"
time = str(hour).zfill(2)+":00:00"
#scan from longitude -180 (180 W) to +180 (180 E) each 5 degree of longitude
for longitude in range(-180,180,5):
#scan from latitude -90 (90 S) to +90 (90 N) each 5 degree of latitude
for latitude in range(-90,90,5):
sql = f'''SELECT time, count(*) AS occurence FROM 'data'
WHERE time ="{time}"
AND latitude BETWEEN {latitude} AND {latitude+5}
AND longitude BETWEEN {longitude} AND {longitude+5}
GROUP BY time'''
data = cur.execute(sql).fetchone() #fetchone because group by time
if data != None:
time, occurence = data
else: #some cell of grid may have no ship at a hour therefore this else
occurence = None
result = [time, occurence, longitude, latitude]
#writing the result to output
writeToFile("\t".join(result), output)
-
2\$\begingroup\$ What is your table structure ? Do you have indexes ? Your three nested loops produce a total of 62209 DB calls. I am thinking that maybe a more sophisticated SELECT could retrieve the information you want, or at least half-baked results that would make the job easier. \$\endgroup\$Kate– Kate2021年04月02日 22:30:20 +00:00Commented Apr 2, 2021 at 22:30
-
\$\begingroup\$ No indexes and that is the structure: CREATE TABLE IF NOT EXISTS 'data' (time TEXT, shipId INTEGER, longitude REAL, latitude REAL); \$\endgroup\$Khaled– Khaled2021年04月02日 22:37:03 +00:00Commented Apr 2, 2021 at 22:37
-
1\$\begingroup\$ Well then, without any indexes, that means your script does a full table scan to retrieve results, and it does that at every iteration. So yes it's going to be slow. Seems to me that you could add latitude + longitude in your GROUP BY, thus get rid of two loops and simplify things. \$\endgroup\$Kate– Kate2021年04月03日 00:00:16 +00:00Commented Apr 3, 2021 at 0:00
-
1\$\begingroup\$ @Anonymous , I think, I got it, I will take the floor of latitude and longitude then divide by 5, multiply by 5 to get the range start of the cell in which lies each ship then group by longitude and latitude and time. It worked perfectly and so much better than before. But I would test it for edges and exceptional cases before taking it as generalization. \$\endgroup\$Khaled– Khaled2021年04月03日 10:43:25 +00:00Commented Apr 3, 2021 at 10:43
-
1\$\begingroup\$ It's difficult to say how the database query could be improved without seeing the definition for that table. \$\endgroup\$Reinderien– Reinderien2021年09月14日 14:47:35 +00:00Commented Sep 14, 2021 at 14:47
1 Answer 1
There are multiple things that I would like to point out to you, just to help you improve coding style. Currently you are not noticing as performance issues as they are hidden due to you DB performance.
Fixed values: You are computing fixed list every time through 'range'. Try to store them in list. As this is unnecessary compute overhead.
File writing: You are opening and closing file per record. This can cause huge disk performance issues once you start noticing. Try to collect record in list and maybe write to file once every 'hour'.
Always try to add 'Indexes' on DB. They are made for purpose of speeding up searching.
-
\$\begingroup\$ Carrying over a connection/cursor between tasks without leaking a connection or staying open unnecessarily long is indeed a good idea, but not simple. Would you have an appropriate approach in mind? \$\endgroup\$2022年01月12日 14:02:18 +00:00Commented Jan 12, 2022 at 14:02
-
\$\begingroup\$ @Mast, can you please explain you question as OP was i think only doing things as single task. \$\endgroup\$mangupt– mangupt2022年01月14日 06:57:46 +00:00Commented Jan 14, 2022 at 6:57
-
\$\begingroup\$ If you like my idea. then i would recommend creating list of all the sql queries in outside the
with sqlite3
and later connect to db and execute. \$\endgroup\$mangupt– mangupt2022年01月14日 06:58:45 +00:00Commented Jan 14, 2022 at 6:58