Python Postgres Multiprocessing speed increase

Question 1

I'm trying to speed up my PostGIS queries using multiprocessing. My current setup is using python and psycopg2 such as below. Although this has given a speed increase, it still seems like there are bottlenecks preventing rapid speed increases and I'm not sure where to go next.

I've increased a lot of the postgres parameters as suggested in 'Performance Tuning Postgres', but when I run this in AWS I never seem to be anywhere near maxing out RAM or I/O which should be the limitations for DB activity apparently. Can anyone suggest other methods for speeding this up?

> import os, sys, psycopg2, multiprocessing, time 
> start = time.time() 
> conn = psycopg2.connect("dbname=template_postgis_20 user=postgres") 
> cur = conn.cursor() 
> cur.execute("""SELECT count(*) FROM sites""") 
> count = cur.fetchall()
> def getOidRanges(rownums, count):
> 
> oidranges = []
> for row in rownums:
> minoid = int(row[0])
> return oidranges
> 
> def mp(rownums, whereclause):
> 
> for row in rownums:
> if row[0] == whereclause:
> gid1 = int(row[0])
> cur.execute("""
> UPDATE sites SET postcode=(SELECT field62 FROM (SELECT field62, COUNT(field62) FROM addressbaseplusbh1_2
> WHERE ST_Within(addressbaseplusbh1_2.geom, (select geom from sites where gid={0})) GROUP BY field62 ORDER BY count DESC)
> as postcode LIMIT 1) WHERE gid = {0};""".format(gid1))
> conn.commit()
> 
> return
> 
> if __name__ == "__main__":
> 
> nbrquery = ("""SELECT gid FROM sites ORDER BY gid;""")
> cur.execute(nbrquery)
> rownums=cur.fetchall()
> 
> cores = (multiprocessing.cpu_count()-1)
> procfeaturelimit = 1
> 
> oidranges = getOidRanges(rownums, count)
> 
> if len(oidranges) > 0:
> pool = multiprocessing.Pool(cores)
> 
> for oidrange in oidranges:
> 
> whereclause = oidrange[0]
> jobs = pool.apply_async(mp, (rownums, whereclause))
> 
> pool.close()
> pool.join()
> jobs.get()
> 
> try:
> conn.commit()
> cur.close()
> conn.close()
> end = time.time()
> print end - start
> except:
> pass

EDIT:

@Craig, would it work then just to have this as the executed block?

curs.execute("""UPDATE sites SET postcode=(SELECT field62 FROM (SELECT field62, COUNT(field62) FROM addressbaseplusbh1_2 WHERE ST_Within(addressbaseplusbh1_2.geom, 
(select geom from sites where gid={0})) GROUP BY field62 ORDER BY count DESC)
as postcode LIMIT 1) WHERE gid = {0};""".format(whereclause))
return

Question 2

Isn't postgre single threaded? So where you may be making requests from multiple threads in python, the db is only processing them on one thread, which would just move your bottleneck to the server's cpu.

Question 3

ah, that might explain it. so is there no way round it with postgres? any other ways to increase speed?

Question 4

You'll have to tune your optimizations for postgre. postgresql.1045698.n5.nabble.com/… seems to indicate that postgre will use one process per connection, which may mean you'd need to open a different connection for each thread you want to use. postgis calls can also be cpu heavy, as postgis compresses large geometries and burn cpu cycles to unpack them.

Question 5

Ugh, can't you do each UPDATE as a set operation not a for loop? Looping over queries is grossly inefficient; try to phrase them as an UPDATE ... FROM over a VALUES clause or temp table. Also, you don't max RAM or I/O - are you maxing out CPU? With queries like that I expect you probably are.

Question 6

@DPierce Pg can use at most one CPU core per connection, that is correct. "single threaded" is misleading since it's a multiprocess server; the key point is that it doesn't support concurrency within a query, only between different queries.

Question 7

You can use dblink in a native Postgres query to split the query up into separate database connections and execute them simultaneously. This is effectively parallelism in Postgres on a single server. It could be mimicked in Python, but I haven't tried it.

There are some limitations: 1) the operation needs to be an insert, not an update. Inserts are generally faster anyway as you're not altering an existing table (depends on your HDD as far as I understand); 2) you'll need an integer ID field to be able to split the query into chunks. Adding a serial field is best as it creates a sequential integer which breaks the work up as evenly as possible.

See Mike Gleason's parallel processing function for the details.

Key performance tip: use the boundary table as the table to split, not the points.

Using this method, we can boundary tag ~10 million points in ~15,000 polygons in about a minute on a 16 core Windows 2012 Server with 128Gb RAM on an SSD. It could run faster in Linux, but I haven't tested it.

minus34 minus34 7879 silver badges21 bronze badges · Accepted Answer · 2014-04-12 11:32:29Z

You can use dblink in a native Postgres query to split the query up into separate database connections and execute them simultaneously. This is effectively parallelism in Postgres on a single server. It could be mimicked in Python, but I haven't tried it.

There are some limitations: 1) the operation needs to be an insert, not an update. Inserts are generally faster anyway as you're not altering an existing table (depends on your HDD as far as I understand); 2) you'll need an integer ID field to be able to split the query into chunks. Adding a serial field is best as it creates a sequential integer which breaks the work up as evenly as possible.

See Mike Gleason's parallel processing function for the details.

Key performance tip: use the boundary table as the table to split, not the points.

Using this method, we can boundary tag ~10 million points in ~15,000 polygons in about a minute on a 16 core Windows 2012 Server with 128Gb RAM on an SSD. It could run faster in Linux, but I haven't tested it.

Stack Exchange Network

Python Postgres Multiprocessing speed increase

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python Postgres Multiprocessing speed increase

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions