While iterating on a list containing table names,
I used arcpy.Exists() to check if a table/featureclass exists.
This action is very time consuming.
I've ended up using a database approach select count(*) from 'tablename'
and checked the query result to check if a table exists, which works very fast.
I even tried using a readonly user/administrative user for arcpy.Exists(), still received the same performance problem.
My questions:
1. Does anyone knows why arcpy.Exists() is so slow vs SQL server (it is much faster in SDE)?
2. Any other ideas for workarounds? my list contains 400 names, I'd rather use arcpy on creating multiple db queries.
-
1SDE no longer exists, and you're already using all that remains of that which used to be ArcSDE. You might try using arcpyListFeatureClasses then using list operators (though you'll need to handle database and schema). Also possible is a dictionary query in SQL, though this RDBMS-dependent.Vince– Vince2019年06月03日 13:47:57 +00:00Commented Jun 3, 2019 at 13:47
-
The list approach might be more efficient than multiple queries - thanks :) [you can add an answer]. The part about SDE : we have an old server that still non SQL server (old arcSDE) and arcpy.Exists() works much faster on it...NettaB– NettaB2019年06月03日 14:36:58 +00:00Commented Jun 3, 2019 at 14:36
-
You seem to be using "SDE" as a database reference, but "old ArcSDE" supported Oracle, Informix, DB2, SQL-Server, and PostgreSQL, and before that Sybase, so it's not clear which one was faster.Vince– Vince2019年06月03日 15:41:13 +00:00Commented Jun 3, 2019 at 15:41
-
@Vince ArcSDE SQL-Server/MS-SQL10 /ArcGIS 9NettaB– NettaB2019年06月03日 15:48:02 +00:00Commented Jun 3, 2019 at 15:48
1 Answer 1
One of the interesting quirks of many SQL client APIs is the lack of a "table exists" function. Many implementations of an "exists" are really a low-level "describe", which returns a "does not exist" error if the table doesn't exist, and populates metadata if it does. The side effect of this is that an "exists" test is often quite slow, because if it succeeds, then it has to clean up something that cost effort to create.
The function in ArcPy which tests for table existence (arcpy.Exists()
) has the overhead of verifying the connection string, then describing the table and cleaning up. By comparison, the arcpy.ListFeatureClasses()
function has connection overhead, and then requests the list of tables, and then needs to either consult the geodatabase metadata or describe each table in turn to see if it has a geometry column, and potentially needs to describe it anyway, to see if anything has changed, or if unsupported columns are present, so this isn't an intrinsically fast process either.
So then the issue is which is faster, compiling a list of valid tables or describing all the tables up front, then sifting through the result for potential matches (the answer for which could vary by database implementation, number of tables, and number of columns).
There is another mechanism, which is to execute an RDBMS-dependent query against the system catalog via the arcpy.ArcSDESQLExecute()
function, and assume that the table is valid if the database reports that it exists.
Now, I don't have a MS SQL Server instance handy (new laptop, no need), but I can perform some timing tests with PostgreSQL. The experiment I propose is as follows:
- Create 65 random-named tables in the form 'tableNN' with six columns, one of which is an identity (ID) column, and another geometry.
- Measure the performance of using
arcpy.Exists
on tables 00 through 99 (tracking the timing for both existing and non-existing tables) - Measure the performance of using
arcpy.ListFeatureClasses
to retrieve all tables as the owner, then test against this list to determine existance - Measure the performance of using
arcpy.ArcSDESQLExecute
to retrieve a table list, then test existance with this list as well - Repeat the above with tables of just two columns, and tables with forty columns
And the results for six, two, and forty columns are:
Tot (all) = 6.1800
Mean (all) = 0.0618
Mean (exi) = 0.0920
Mean (dne) = 0.0057
List elapsed = 3.4900
List total = 3.4900 (65 found)
SQL elapsed = 0.0200
SQL total = 0.0200 (65 found)
----------
Tot (all) = 5.6300
Mean (all) = 0.0563
Mean (exi) = 0.0832
Mean (dne) = 0.0063
List elapsed = 2.6800
List total = 2.6800 (65 found)
SQL elapsed = 0.0100
SQL total = 0.0100 (65 found)
----------
Tot (all) = 6.5800
Mean (all) = 0.0658
Mean (exi) = 0.0986
Mean (dne) = 0.0049
List elapsed = 3.3600
List total = 3.3600 (65 found)
SQL elapsed = 0.0100
SQL total = 0.0100 (65 found)
So it seems as if, anecdotally, my above assertions are borne out:
- The
Exists
cascade takes 5.63-6.58 seconds - Testing for non-existent tables is significantly (order of magnitude) faster than existing ones
- Wider tables take longer to test than skinnier ones
- Using
ListFeatureClasses
takes 2.68-3.49 seconds (roughly half ofExists
) - Using the
information_schema
query in PostgreSQL takes 10-20 milliseconds
Also somewhat interesting: Converting the list to a dictionary and then searching the dictionary 100 times took no measureable time (Windows box, so sub-millisecond)
Now obviously, you'd need to tweak the actual SQL for SQL Server, but I'll provide the Python here for reference*:
import math
import os
import random
from datetime import datetime
print("Importing arcpy...")
import arcpy
# -----------------------------------
def findTables(names,schema,fmt,count):
found = []
d = {}
for name in names:
parts = name.split('.')
d[parts[2]] = name
for i in range(count):
table = fmt.format(i)
if (table in d):
found.append(table)
return found
# -----------------------------------
verbose=False
keep=65
fudge=keep-12
total=100
schema=""
geomcol="geom"
tableFmt="xxtmp_table{:02d}"
random.seed(42)
dropStatement = "DROP TABLE IF EXISTS {:s}{:s}"
createStatement6 = """CREATE TABLE {:s}{:s} (
objectid serial NOT NULL,
col1 smallint NULL,
col2 integer NULL,
col3 real NULL,
col4 varchar(20) NULL,
{:<12s} geometry NULL,
CONSTRAINT {:s}_pkey PRIMARY KEY (objectid),
CONSTRAINT enforce_srid_{:s} CHECK (st_srid({:s}) = 4326)
)
WITH (
OIDS=FALSE
)"""
createStatement2 = """CREATE TABLE {:s}{:s} (
objectid serial NOT NULL,
{:<12s} geometry NULL,
CONSTRAINT {:s}_pkey PRIMARY KEY (objectid),
CONSTRAINT enforce_srid_{:s} CHECK (st_srid({:s}) = 4326)
)
WITH (
OIDS=FALSE
)"""
createStatement40 = """CREATE TABLE {:s}{:s} (
objectid serial NOT NULL,
col1 smallint NULL,
col2 integer NULL,
col3 real NULL,
col4 varchar(20) NULL,
col5 varchar(20) NULL,
col6 smallint NULL,
col7 integer NULL,
col8 real NULL,
col9 varchar(20) NULL,
col10 varchar(20) NULL,
col11 smallint NULL,
col12 integer NULL,
col13 real NULL,
col14 varchar(20) NULL,
col15 varchar(20) NULL,
col16 smallint NULL,
col17 integer NULL,
col18 real NULL,
col19 varchar(20) NULL,
col20 varchar(20) NULL,
col21 smallint NULL,
col22 integer NULL,
col23 real NULL,
col24 varchar(20) NULL,
col25 varchar(20) NULL,
col26 smallint NULL,
col27 integer NULL,
col28 real NULL,
col29 varchar(20) NULL,
col30 varchar(20) NULL,
col31 smallint NULL,
col32 integer NULL,
col33 real NULL,
col34 varchar(20) NULL,
col35 varchar(20) NULL,
col36 smallint NULL,
col37 integer NULL,
col38 real NULL,
{:<12s} geometry NULL,
CONSTRAINT {:s}_pkey PRIMARY KEY (objectid),
CONSTRAINT enforce_srid_{:s} CHECK (st_srid({:s}) = 4326)
)
WITH (
OIDS=FALSE
)"""
listStatement="""SELECT (table_catalog ||'.' || table_schema || '.' || table_name)::varchar(100) as name
FROM information_schema.tables
WHERE table_type = 'BASE TABLE'
AND table_schema NOT IN ('pg_catalog', 'information_schema')"""
conn = os.path.join(os.getcwd(),"bench.sde")
cursor = arcpy.ArcSDESQLExecute(conn)
for createStatement in [createStatement6, createStatement2, createStatement40]:
count = 0
print '----------'
# .. (re)Create tables
start = datetime.datetime.now()
created=[]
for i in range(total):
if (math.floor(random.random() * 100) < fudge or (total - i) < (total - keep)):
table = tableFmt.format(i)
sql = dropStatement.format(schema,table)
if (verbose): print("{:s};".format(sql))
cursor.execute(sql)
sql = createStatement.format(schema,table,geomcol,table,geomcol,geomcol)
if (verbose): print("{:s};\n".format(sql))
cursor.execute(sql)
created.append(table)
count += 1
if (count >= keep): break
elapsed = (datetime.datetime.now() - start).total_seconds()
#print("{:>16s} = {:.4f} secs".format("Prep",elapsed))
# .. Measure Exists
allSum = 0.0
exiSum = 0.0
dneSum = 0.0
for i in range(total):
table = tableFmt.format(i)
start = datetime.datetime.now()
doesExist = arcpy.Exists(os.path.join(conn,"{:s}{:s}".format(schema,table)))
elapsed = (datetime.datetime.now() - start).total_seconds()
allSum += elapsed
if (doesExist):
exiSum += elapsed
else:
dneSum += elapsed
if (verbose): print("{:s} ({:s})".format(table,"Y" if doesExist else "N"))
print("{:>16s} = {:.4f}".format("Tot (all)",allSum))
print("{:>16s} = {:.4f}".format("Mean (all)",allSum / float(total)))
print("{:>16s} = {:.4f}".format("Mean (exi)",exiSum / float(keep)))
print("{:>16s} = {:.4f}".format("Mean (dne)",dneSum / float(total-keep)))
# .. Measure ListFC
arcpy.env.workspace = conn
start = datetime.datetime.now()
allTables = arcpy.ListFeatureClasses()
elapsed = (datetime.datetime.now() - start).total_seconds()
print("{:>16s} = {:.4f}".format("List elapsed",elapsed))
start = datetime.datetime.now()
found = findTables(allTables,schema,tableFmt,total)
elapsed += (datetime.datetime.now() - start).total_seconds()
print("{:>16s} = {:.4f} ({:d} found)".format("List total",elapsed,len(found)))
# .. Measure SQL
start = datetime.datetime.now()
sql = listStatement
allTables = [row[0] for row in cursor.execute(sql)]
elapsed = (datetime.datetime.now() - start).total_seconds()
print("{:>16s} = {:.4f}".format("SQL elapsed",elapsed))
start = datetime.datetime.now()
found = findTables(allTables,schema,tableFmt,total)
elapsed += (datetime.datetime.now() - start).total_seconds()
print("{:>16s} = {:.4f} ({:d} found)".format("SQL total",elapsed,len(found)))
# .. Cleanup
start = datetime.datetime.now()
for table in created:
sql = dropStatement.format(schema,table)
if (verbose): print("{:s};".format(sql))
cursor.execute(sql)
elapsed = (datetime.datetime.now() - start).total_seconds()
#print("{:>16s} = {:.4f} secs".format("Cleanup",elapsed))
cursor = None
#EOF
*Note: hack job; took way longer than I wanted to make, so it's as-is.
Follow-up: I used this technique in an application against a PostgreSQL database and ran into an issue with failure to detect views and materialized views. Adding view support was as simple as changing the SQL that compiles the list to UNION ALL
a query to information_schema.views:
SELECT (table_schema || '.' || table_name)::varchar(255) as name
FROM information_schema.tables
WHERE table_type = 'BASE TABLE'
AND table_schema NOT IN ('pg_catalog', 'information_schema')
UNION ALL
SELECT (table_schema || '.' || table_name)::varchar(255) as name
FROM information_schema.views
WHERE table_schema NOT IN ('pg_catalog', 'information_schema')
But for PostgreSQL, at least, materialized views are not published through information_schema
(because MATERIALIZED VIEW
is an extension), so for them I needed to append a PG-specific catalog query:
UNION ALL
SELECT (schemaname || '.' || matviewname)::varchar(255) as name
FROM pg_matviews
-
Wow! I know comments are not meant to be a 'thank you' place, still wanted to point out that your answer is really thorough and interesting :)NettaB– NettaB2019年06月04日 09:37:49 +00:00Commented Jun 4, 2019 at 9:37
-
1I just added a test based on
"SELECT count(*) FROM {:s}{:s} WHERE 1 = 0".format(schema,table)
, just for grins, and it came in at 120-260 milliseconds, which is way better thanListFeatureClasses
but still an order of magnitude off theinformation_schema
query. If your tables have rows, adding the "WHERE 1 = 0
" is the canonical way to avoid the overhead of processing them.Vince– Vince2019年06月05日 01:58:02 +00:00Commented Jun 5, 2019 at 1:58
Explore related questions
See similar questions with these tags.