4

I'm trying to speed up my PostGIS queries using multiprocessing. My current setup is using python and psycopg2 such as below. Although this has given a speed increase, it still seems like there are bottlenecks preventing rapid speed increases and I'm not sure where to go next.

I've increased a lot of the postgres parameters as suggested in 'Performance Tuning Postgres', but when I run this in AWS I never seem to be anywhere near maxing out RAM or I/O which should be the limitations for DB activity apparently. Can anyone suggest other methods for speeding this up?

> import os, sys, psycopg2, multiprocessing, time 
> start = time.time() 
> conn = psycopg2.connect("dbname=template_postgis_20 user=postgres") 
> cur = conn.cursor() 
> cur.execute("""SELECT count(*) FROM sites""") 
> count = cur.fetchall()
> def getOidRanges(rownums, count):
> 
> oidranges = []
> for row in rownums:
> minoid = int(row[0])
> return oidranges
> 
> def mp(rownums, whereclause):
> 
> for row in rownums:
> if row[0] == whereclause:
> gid1 = int(row[0])
> cur.execute("""
> UPDATE sites SET postcode=(SELECT field62 FROM (SELECT field62, COUNT(field62) FROM addressbaseplusbh1_2
> WHERE ST_Within(addressbaseplusbh1_2.geom, (select geom from sites where gid={0})) GROUP BY field62 ORDER BY count DESC)
> as postcode LIMIT 1) WHERE gid = {0};""".format(gid1))
> conn.commit()
> 
> return
> 
> if __name__ == "__main__":
> 
> nbrquery = ("""SELECT gid FROM sites ORDER BY gid;""")
> cur.execute(nbrquery)
> rownums=cur.fetchall()
> 
> cores = (multiprocessing.cpu_count()-1)
> procfeaturelimit = 1
> 
> oidranges = getOidRanges(rownums, count)
> 
> if len(oidranges) > 0:
> pool = multiprocessing.Pool(cores)
> 
> for oidrange in oidranges:
> 
> whereclause = oidrange[0]
> jobs = pool.apply_async(mp, (rownums, whereclause))
> 
> pool.close()
> pool.join()
> jobs.get()
> 
> try:
> conn.commit()
> cur.close()
> conn.close()
> end = time.time()
> print end - start
> except:
> pass

EDIT:

@Craig, would it work then just to have this as the executed block?

curs.execute("""UPDATE sites SET postcode=(SELECT field62 FROM (SELECT field62, COUNT(field62) FROM addressbaseplusbh1_2 WHERE ST_Within(addressbaseplusbh1_2.geom, 
(select geom from sites where gid={0})) GROUP BY field62 ORDER BY count DESC)
as postcode LIMIT 1) WHERE gid = {0};""".format(whereclause))
return
asked Feb 20, 2014 at 16:05
6
  • 1
    Isn't postgre single threaded? So where you may be making requests from multiple threads in python, the db is only processing them on one thread, which would just move your bottleneck to the server's cpu. Commented Feb 20, 2014 at 16:19
  • ah, that might explain it. so is there no way round it with postgres? any other ways to increase speed? Commented Feb 20, 2014 at 16:21
  • You'll have to tune your optimizations for postgre. postgresql.1045698.n5.nabble.com/… seems to indicate that postgre will use one process per connection, which may mean you'd need to open a different connection for each thread you want to use. postgis calls can also be cpu heavy, as postgis compresses large geometries and burn cpu cycles to unpack them. Commented Feb 20, 2014 at 16:36
  • 2
    Ugh, can't you do each UPDATE as a set operation not a for loop? Looping over queries is grossly inefficient; try to phrase them as an UPDATE ... FROM over a VALUES clause or temp table. Also, you don't max RAM or I/O - are you maxing out CPU? With queries like that I expect you probably are. Commented Feb 20, 2014 at 17:36
  • 1
    @DPierce Pg can use at most one CPU core per connection, that is correct. "single threaded" is misleading since it's a multiprocess server; the key point is that it doesn't support concurrency within a query, only between different queries. Commented Feb 20, 2014 at 17:37

1 Answer 1

2

You can use dblink in a native Postgres query to split the query up into separate database connections and execute them simultaneously. This is effectively parallelism in Postgres on a single server. It could be mimicked in Python, but I haven't tried it.

There are some limitations: 1) the operation needs to be an insert, not an update. Inserts are generally faster anyway as you're not altering an existing table (depends on your HDD as far as I understand); 2) you'll need an integer ID field to be able to split the query into chunks. Adding a serial field is best as it creates a sequential integer which breaks the work up as evenly as possible.

See Mike Gleason's parallel processing function for the details.

Key performance tip: use the boundary table as the table to split, not the points.

Using this method, we can boundary tag ~10 million points in ~15,000 polygons in about a minute on a 16 core Windows 2012 Server with 128Gb RAM on an SSD. It could run faster in Linux, but I haven't tested it.

answered Apr 12, 2014 at 11:32

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.