1
\$\begingroup\$

I have a table data stored in a database ships.db, the data are informations of tracked ships hourly. The table data looks like this.

time | shipId | longitude | latitude
---------------------------------------------------------------------
00:00:00 1 xx.xxxx yy.yyyy
00:00:00 2 xx.xxxx yy.yyyy
00:00:00 3 xx.xxxx yy.yyyy
00:00:00 4 xx.xxxx yy.yyyy
01:00:00 2 xx.xxxx yy.yyyy
01:00:00 4 xx.xxxx yy.yyyy
 ... ... ... ...
23:00:00 4 xx.xxxx yy.yyyy

Splitting the whole earth to a grid of 5-degree width and length for each cell, I would get the number of fetched records hourly per cell of that grid.

Note that the number of records are not the same each hour because some ships are not more live therefore not fetched.

I wrote this code in python, it works but it takes large time because the database has roughly 250000 records. Is there another method or approach to make it better and faster in python?

My script:

import sqlite3
def writeToFile(string, file):
 with open(file,"a") as ouf:
 ouf.write(string+"\n")
output = "report.txt"
with sqlite3.connect("ships.db") as con:
 cur = con.cursor()
 #iterate over times from 0 to 23 (hours)
 for hour in range(0,24): # hours: from 0 to 23
 #make each loop of time in this time format "hh:00:00"
 time = str(hour).zfill(2)+":00:00"
 #scan from longitude -180 (180 W) to +180 (180 E) each 5 degree of longitude
 for longitude in range(-180,180,5):
 #scan from latitude -90 (90 S) to +90 (90 N) each 5 degree of latitude
 for latitude in range(-90,90,5):
 sql = f'''SELECT time, count(*) AS occurence FROM 'data'
 WHERE time ="{time}"
 AND latitude BETWEEN {latitude} AND {latitude+5}
 AND longitude BETWEEN {longitude} AND {longitude+5}
 GROUP BY time'''
 data = cur.execute(sql).fetchone() #fetchone because group by time
 if data != None:
 time, occurence = data
 else: #some cell of grid may have no ship at a hour therefore this else
 occurence = None
 result = [time, occurence, longitude, latitude]
 
 #writing the result to output
 writeToFile("\t".join(result), output)
asked Apr 2, 2021 at 19:45
\$\endgroup\$
6
  • 2
    \$\begingroup\$ What is your table structure ? Do you have indexes ? Your three nested loops produce a total of 62209 DB calls. I am thinking that maybe a more sophisticated SELECT could retrieve the information you want, or at least half-baked results that would make the job easier. \$\endgroup\$ Commented Apr 2, 2021 at 22:30
  • \$\begingroup\$ No indexes and that is the structure: CREATE TABLE IF NOT EXISTS 'data' (time TEXT, shipId INTEGER, longitude REAL, latitude REAL); \$\endgroup\$ Commented Apr 2, 2021 at 22:37
  • 1
    \$\begingroup\$ Well then, without any indexes, that means your script does a full table scan to retrieve results, and it does that at every iteration. So yes it's going to be slow. Seems to me that you could add latitude + longitude in your GROUP BY, thus get rid of two loops and simplify things. \$\endgroup\$ Commented Apr 3, 2021 at 0:00
  • 1
    \$\begingroup\$ @Anonymous , I think, I got it, I will take the floor of latitude and longitude then divide by 5, multiply by 5 to get the range start of the cell in which lies each ship then group by longitude and latitude and time. It worked perfectly and so much better than before. But I would test it for edges and exceptional cases before taking it as generalization. \$\endgroup\$ Commented Apr 3, 2021 at 10:43
  • 1
    \$\begingroup\$ It's difficult to say how the database query could be improved without seeing the definition for that table. \$\endgroup\$ Commented Sep 14, 2021 at 14:47

1 Answer 1

1
\$\begingroup\$

There are multiple things that I would like to point out to you, just to help you improve coding style. Currently you are not noticing as performance issues as they are hidden due to you DB performance.

  1. Fixed values: You are computing fixed list every time through 'range'. Try to store them in list. As this is unnecessary compute overhead.

  2. File writing: You are opening and closing file per record. This can cause huge disk performance issues once you start noticing. Try to collect record in list and maybe write to file once every 'hour'.

  3. Always try to add 'Indexes' on DB. They are made for purpose of speeding up searching.

answered Apr 17, 2021 at 6:50
\$\endgroup\$
3
  • \$\begingroup\$ Carrying over a connection/cursor between tasks without leaking a connection or staying open unnecessarily long is indeed a good idea, but not simple. Would you have an appropriate approach in mind? \$\endgroup\$ Commented Jan 12, 2022 at 14:02
  • \$\begingroup\$ @Mast, can you please explain you question as OP was i think only doing things as single task. \$\endgroup\$ Commented Jan 14, 2022 at 6:57
  • \$\begingroup\$ If you like my idea. then i would recommend creating list of all the sql queries in outside the with sqlite3 and later connect to db and execute. \$\endgroup\$ Commented Jan 14, 2022 at 6:58

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.