Performance of arcpy.Exists() in SQL Server enterprise gdb environment

Question 1

While iterating on a list containing table names,
I used arcpy.Exists() to check if a table/featureclass exists.
This action is very time consuming.
I've ended up using a database approach select count(*) from 'tablename' and checked the query result to check if a table exists, which works very fast. I even tried using a readonly user/administrative user for arcpy.Exists(), still received the same performance problem.
My questions:
1. Does anyone knows why arcpy.Exists() is so slow vs SQL server (it is much faster in SDE)?
2. Any other ideas for workarounds? my list contains 400 names, I'd rather use arcpy on creating multiple db queries.

Question 2

SDE no longer exists, and you're already using all that remains of that which used to be ArcSDE. You might try using arcpyListFeatureClasses then using list operators (though you'll need to handle database and schema). Also possible is a dictionary query in SQL, though this RDBMS-dependent.

Question 3

The list approach might be more efficient than multiple queries - thanks :) [you can add an answer]. The part about SDE : we have an old server that still non SQL server (old arcSDE) and arcpy.Exists() works much faster on it...

Question 4

You seem to be using "SDE" as a database reference, but "old ArcSDE" supported Oracle, Informix, DB2, SQL-Server, and PostgreSQL, and before that Sybase, so it's not clear which one was faster.

Question 5

@Vince ArcSDE SQL-Server/MS-SQL10 /ArcGIS 9

Question 6

One of the interesting quirks of many SQL client APIs is the lack of a "table exists" function. Many implementations of an "exists" are really a low-level "describe", which returns a "does not exist" error if the table doesn't exist, and populates metadata if it does. The side effect of this is that an "exists" test is often quite slow, because if it succeeds, then it has to clean up something that cost effort to create.

The function in ArcPy which tests for table existence (arcpy.Exists()) has the overhead of verifying the connection string, then describing the table and cleaning up. By comparison, the arcpy.ListFeatureClasses() function has connection overhead, and then requests the list of tables, and then needs to either consult the geodatabase metadata or describe each table in turn to see if it has a geometry column, and potentially needs to describe it anyway, to see if anything has changed, or if unsupported columns are present, so this isn't an intrinsically fast process either. So then the issue is which is faster, compiling a list of valid tables or describing all the tables up front, then sifting through the result for potential matches (the answer for which could vary by database implementation, number of tables, and number of columns).

There is another mechanism, which is to execute an RDBMS-dependent query against the system catalog via the arcpy.ArcSDESQLExecute() function, and assume that the table is valid if the database reports that it exists.

Now, I don't have a MS SQL Server instance handy (new laptop, no need), but I can perform some timing tests with PostgreSQL. The experiment I propose is as follows:

Create 65 random-named tables in the form 'tableNN' with six columns, one of which is an identity (ID) column, and another geometry.
Measure the performance of using arcpy.Exists on tables 00 through 99 (tracking the timing for both existing and non-existing tables)
Measure the performance of using arcpy.ListFeatureClasses to retrieve all tables as the owner, then test against this list to determine existance
Measure the performance of using arcpy.ArcSDESQLExecute to retrieve a table list, then test existance with this list as well
Repeat the above with tables of just two columns, and tables with forty columns

And the results for six, two, and forty columns are:

 Tot (all) = 6.1800
 Mean (all) = 0.0618
 Mean (exi) = 0.0920
 Mean (dne) = 0.0057
 List elapsed = 3.4900
 List total = 3.4900 (65 found)
 SQL elapsed = 0.0200
 SQL total = 0.0200 (65 found)
----------
 Tot (all) = 5.6300
 Mean (all) = 0.0563
 Mean (exi) = 0.0832
 Mean (dne) = 0.0063
 List elapsed = 2.6800
 List total = 2.6800 (65 found)
 SQL elapsed = 0.0100
 SQL total = 0.0100 (65 found)
----------
 Tot (all) = 6.5800
 Mean (all) = 0.0658
 Mean (exi) = 0.0986
 Mean (dne) = 0.0049
 List elapsed = 3.3600
 List total = 3.3600 (65 found)
 SQL elapsed = 0.0100
 SQL total = 0.0100 (65 found)

So it seems as if, anecdotally, my above assertions are borne out:

The Exists cascade takes 5.63-6.58 seconds
Testing for non-existent tables is significantly (order of magnitude) faster than existing ones
Wider tables take longer to test than skinnier ones
Using ListFeatureClasses takes 2.68-3.49 seconds (roughly half of Exists)
Using the information_schema query in PostgreSQL takes 10-20 milliseconds

Also somewhat interesting: Converting the list to a dictionary and then searching the dictionary 100 times took no measureable time (Windows box, so sub-millisecond)

Now obviously, you'd need to tweak the actual SQL for SQL Server, but I'll provide the Python here for reference*:

import math
import os
import random
from datetime import datetime
print("Importing arcpy...")
import arcpy
# -----------------------------------
def findTables(names,schema,fmt,count):
 found = []
 d = {}
 for name in names:
 parts = name.split('.')
 d[parts[2]] = name
 for i in range(count):
 table = fmt.format(i)
 if (table in d):
 found.append(table)
 return found
# -----------------------------------
verbose=False
keep=65
fudge=keep-12
total=100
schema=""
geomcol="geom"
tableFmt="xxtmp_table{:02d}"
random.seed(42)
dropStatement = "DROP TABLE IF EXISTS {:s}{:s}"
createStatement6 = """CREATE TABLE {:s}{:s} (
 objectid serial NOT NULL,
 col1 smallint NULL,
 col2 integer NULL,
 col3 real NULL,
 col4 varchar(20) NULL,
 {:<12s} geometry NULL,
 CONSTRAINT {:s}_pkey PRIMARY KEY (objectid),
 CONSTRAINT enforce_srid_{:s} CHECK (st_srid({:s}) = 4326)
)
WITH (
 OIDS=FALSE
)"""
createStatement2 = """CREATE TABLE {:s}{:s} (
 objectid serial NOT NULL,
 {:<12s} geometry NULL,
 CONSTRAINT {:s}_pkey PRIMARY KEY (objectid),
 CONSTRAINT enforce_srid_{:s} CHECK (st_srid({:s}) = 4326)
)
WITH (
 OIDS=FALSE
)"""
createStatement40 = """CREATE TABLE {:s}{:s} (
 objectid serial NOT NULL,
 col1 smallint NULL,
 col2 integer NULL,
 col3 real NULL,
 col4 varchar(20) NULL,
 col5 varchar(20) NULL,
 col6 smallint NULL,
 col7 integer NULL,
 col8 real NULL,
 col9 varchar(20) NULL,
 col10 varchar(20) NULL,
 col11 smallint NULL,
 col12 integer NULL,
 col13 real NULL,
 col14 varchar(20) NULL,
 col15 varchar(20) NULL,
 col16 smallint NULL,
 col17 integer NULL,
 col18 real NULL,
 col19 varchar(20) NULL,
 col20 varchar(20) NULL,
 col21 smallint NULL,
 col22 integer NULL,
 col23 real NULL,
 col24 varchar(20) NULL,
 col25 varchar(20) NULL,
 col26 smallint NULL,
 col27 integer NULL,
 col28 real NULL,
 col29 varchar(20) NULL,
 col30 varchar(20) NULL,
 col31 smallint NULL,
 col32 integer NULL,
 col33 real NULL,
 col34 varchar(20) NULL,
 col35 varchar(20) NULL,
 col36 smallint NULL,
 col37 integer NULL,
 col38 real NULL,
 {:<12s} geometry NULL,
 CONSTRAINT {:s}_pkey PRIMARY KEY (objectid),
 CONSTRAINT enforce_srid_{:s} CHECK (st_srid({:s}) = 4326)
)
WITH (
 OIDS=FALSE
)"""
listStatement="""SELECT (table_catalog ||'.' || table_schema || '.' || table_name)::varchar(100) as name
FROM information_schema.tables
WHERE table_type = 'BASE TABLE'
 AND table_schema NOT IN ('pg_catalog', 'information_schema')"""
conn = os.path.join(os.getcwd(),"bench.sde")
cursor = arcpy.ArcSDESQLExecute(conn)
for createStatement in [createStatement6, createStatement2, createStatement40]:
 count = 0
 print '----------'
 # .. (re)Create tables
 start = datetime.datetime.now()
 created=[]
 for i in range(total):
 if (math.floor(random.random() * 100) < fudge or (total - i) < (total - keep)):
 table = tableFmt.format(i)
 sql = dropStatement.format(schema,table)
 if (verbose): print("{:s};".format(sql))
 cursor.execute(sql)
 sql = createStatement.format(schema,table,geomcol,table,geomcol,geomcol)
 if (verbose): print("{:s};\n".format(sql))
 cursor.execute(sql)
 created.append(table)
 count += 1
 if (count >= keep): break
 elapsed = (datetime.datetime.now() - start).total_seconds()
 #print("{:>16s} = {:.4f} secs".format("Prep",elapsed))
 # .. Measure Exists
 allSum = 0.0
 exiSum = 0.0
 dneSum = 0.0
 for i in range(total):
 table = tableFmt.format(i)
 start = datetime.datetime.now()
 doesExist = arcpy.Exists(os.path.join(conn,"{:s}{:s}".format(schema,table)))
 elapsed = (datetime.datetime.now() - start).total_seconds()
 allSum += elapsed
 if (doesExist):
 exiSum += elapsed
 else:
 dneSum += elapsed
 if (verbose): print("{:s} ({:s})".format(table,"Y" if doesExist else "N"))
 print("{:>16s} = {:.4f}".format("Tot (all)",allSum))
 print("{:>16s} = {:.4f}".format("Mean (all)",allSum / float(total)))
 print("{:>16s} = {:.4f}".format("Mean (exi)",exiSum / float(keep)))
 print("{:>16s} = {:.4f}".format("Mean (dne)",dneSum / float(total-keep)))
 # .. Measure ListFC
 arcpy.env.workspace = conn
 start = datetime.datetime.now()
 allTables = arcpy.ListFeatureClasses()
 elapsed = (datetime.datetime.now() - start).total_seconds()
 print("{:>16s} = {:.4f}".format("List elapsed",elapsed))
 start = datetime.datetime.now()
 found = findTables(allTables,schema,tableFmt,total)
 elapsed += (datetime.datetime.now() - start).total_seconds()
 print("{:>16s} = {:.4f} ({:d} found)".format("List total",elapsed,len(found)))
 # .. Measure SQL
 start = datetime.datetime.now()
 sql = listStatement
 allTables = [row[0] for row in cursor.execute(sql)]
 elapsed = (datetime.datetime.now() - start).total_seconds()
 print("{:>16s} = {:.4f}".format("SQL elapsed",elapsed))
 start = datetime.datetime.now()
 found = findTables(allTables,schema,tableFmt,total)
 elapsed += (datetime.datetime.now() - start).total_seconds()
 print("{:>16s} = {:.4f} ({:d} found)".format("SQL total",elapsed,len(found)))
 # .. Cleanup
 start = datetime.datetime.now()
 for table in created:
 sql = dropStatement.format(schema,table)
 if (verbose): print("{:s};".format(sql))
 cursor.execute(sql)
 elapsed = (datetime.datetime.now() - start).total_seconds()
 #print("{:>16s} = {:.4f} secs".format("Cleanup",elapsed))
cursor = None
#EOF

*Note: hack job; took way longer than I wanted to make, so it's as-is.

Follow-up: I used this technique in an application against a PostgreSQL database and ran into an issue with failure to detect views and materialized views. Adding view support was as simple as changing the SQL that compiles the list to UNION ALL a query to information_schema.views:

 SELECT (table_schema || '.' || table_name)::varchar(255) as name
 FROM information_schema.tables
 WHERE table_type = 'BASE TABLE'
 AND table_schema NOT IN ('pg_catalog', 'information_schema')
UNION ALL
 SELECT (table_schema || '.' || table_name)::varchar(255) as name
 FROM information_schema.views
 WHERE table_schema NOT IN ('pg_catalog', 'information_schema')

But for PostgreSQL, at least, materialized views are not published through information_schema (because MATERIALIZED VIEW is an extension), so for them I needed to append a PG-specific catalog query:

UNION ALL
 SELECT (schemaname || '.' || matviewname)::varchar(255) as name
 FROM pg_matviews

Question 7

Wow! I know comments are not meant to be a 'thank you' place, still wanted to point out that your answer is really thorough and interesting :)

Question 8

I just added a test based on "SELECT count(*) FROM {:s}{:s} WHERE 1 = 0".format(schema,table), just for grins, and it came in at 120-260 milliseconds, which is way better than ListFeatureClasses but still an order of magnitude off the information_schema query. If your tables have rows, adding the "WHERE 1 = 0" is the canonical way to avoid the overhead of processing them.

Vince Vince 20.5k16 gold badges49 silver badges65 bronze badges · Accepted Answer · 2019-06-04 02:11:39Z

One of the interesting quirks of many SQL client APIs is the lack of a "table exists" function. Many implementations of an "exists" are really a low-level "describe", which returns a "does not exist" error if the table doesn't exist, and populates metadata if it does. The side effect of this is that an "exists" test is often quite slow, because if it succeeds, then it has to clean up something that cost effort to create.

The function in ArcPy which tests for table existence (arcpy.Exists()) has the overhead of verifying the connection string, then describing the table and cleaning up. By comparison, the arcpy.ListFeatureClasses() function has connection overhead, and then requests the list of tables, and then needs to either consult the geodatabase metadata or describe each table in turn to see if it has a geometry column, and potentially needs to describe it anyway, to see if anything has changed, or if unsupported columns are present, so this isn't an intrinsically fast process either. So then the issue is which is faster, compiling a list of valid tables or describing all the tables up front, then sifting through the result for potential matches (the answer for which could vary by database implementation, number of tables, and number of columns).

There is another mechanism, which is to execute an RDBMS-dependent query against the system catalog via the arcpy.ArcSDESQLExecute() function, and assume that the table is valid if the database reports that it exists.

Now, I don't have a MS SQL Server instance handy (new laptop, no need), but I can perform some timing tests with PostgreSQL. The experiment I propose is as follows:

Create 65 random-named tables in the form 'tableNN' with six columns, one of which is an identity (ID) column, and another geometry.
Measure the performance of using arcpy.Exists on tables 00 through 99 (tracking the timing for both existing and non-existing tables)
Measure the performance of using arcpy.ListFeatureClasses to retrieve all tables as the owner, then test against this list to determine existance
Measure the performance of using arcpy.ArcSDESQLExecute to retrieve a table list, then test existance with this list as well
Repeat the above with tables of just two columns, and tables with forty columns

And the results for six, two, and forty columns are:

 Tot (all) = 6.1800
 Mean (all) = 0.0618
 Mean (exi) = 0.0920
 Mean (dne) = 0.0057
 List elapsed = 3.4900
 List total = 3.4900 (65 found)
 SQL elapsed = 0.0200
 SQL total = 0.0200 (65 found)
----------
 Tot (all) = 5.6300
 Mean (all) = 0.0563
 Mean (exi) = 0.0832
 Mean (dne) = 0.0063
 List elapsed = 2.6800
 List total = 2.6800 (65 found)
 SQL elapsed = 0.0100
 SQL total = 0.0100 (65 found)
----------
 Tot (all) = 6.5800
 Mean (all) = 0.0658
 Mean (exi) = 0.0986
 Mean (dne) = 0.0049
 List elapsed = 3.3600
 List total = 3.3600 (65 found)
 SQL elapsed = 0.0100
 SQL total = 0.0100 (65 found)

So it seems as if, anecdotally, my above assertions are borne out:

The Exists cascade takes 5.63-6.58 seconds
Testing for non-existent tables is significantly (order of magnitude) faster than existing ones
Wider tables take longer to test than skinnier ones
Using ListFeatureClasses takes 2.68-3.49 seconds (roughly half of Exists)
Using the information_schema query in PostgreSQL takes 10-20 milliseconds

Also somewhat interesting: Converting the list to a dictionary and then searching the dictionary 100 times took no measureable time (Windows box, so sub-millisecond)

Now obviously, you'd need to tweak the actual SQL for SQL Server, but I'll provide the Python here for reference*:

import math
import os
import random
from datetime import datetime
print("Importing arcpy...")
import arcpy
# -----------------------------------
def findTables(names,schema,fmt,count):
 found = []
 d = {}
 for name in names:
 parts = name.split('.')
 d[parts[2]] = name
 for i in range(count):
 table = fmt.format(i)
 if (table in d):
 found.append(table)
 return found
# -----------------------------------
verbose=False
keep=65
fudge=keep-12
total=100
schema=""
geomcol="geom"
tableFmt="xxtmp_table{:02d}"
random.seed(42)
dropStatement = "DROP TABLE IF EXISTS {:s}{:s}"
createStatement6 = """CREATE TABLE {:s}{:s} (
 objectid serial NOT NULL,
 col1 smallint NULL,
 col2 integer NULL,
 col3 real NULL,
 col4 varchar(20) NULL,
 {:<12s} geometry NULL,
 CONSTRAINT {:s}_pkey PRIMARY KEY (objectid),
 CONSTRAINT enforce_srid_{:s} CHECK (st_srid({:s}) = 4326)
)
WITH (
 OIDS=FALSE
)"""
createStatement2 = """CREATE TABLE {:s}{:s} (
 objectid serial NOT NULL,
 {:<12s} geometry NULL,
 CONSTRAINT {:s}_pkey PRIMARY KEY (objectid),
 CONSTRAINT enforce_srid_{:s} CHECK (st_srid({:s}) = 4326)
)
WITH (
 OIDS=FALSE
)"""
createStatement40 = """CREATE TABLE {:s}{:s} (
 objectid serial NOT NULL,
 col1 smallint NULL,
 col2 integer NULL,
 col3 real NULL,
 col4 varchar(20) NULL,
 col5 varchar(20) NULL,
 col6 smallint NULL,
 col7 integer NULL,
 col8 real NULL,
 col9 varchar(20) NULL,
 col10 varchar(20) NULL,
 col11 smallint NULL,
 col12 integer NULL,
 col13 real NULL,
 col14 varchar(20) NULL,
 col15 varchar(20) NULL,
 col16 smallint NULL,
 col17 integer NULL,
 col18 real NULL,
 col19 varchar(20) NULL,
 col20 varchar(20) NULL,
 col21 smallint NULL,
 col22 integer NULL,
 col23 real NULL,
 col24 varchar(20) NULL,
 col25 varchar(20) NULL,
 col26 smallint NULL,
 col27 integer NULL,
 col28 real NULL,
 col29 varchar(20) NULL,
 col30 varchar(20) NULL,
 col31 smallint NULL,
 col32 integer NULL,
 col33 real NULL,
 col34 varchar(20) NULL,
 col35 varchar(20) NULL,
 col36 smallint NULL,
 col37 integer NULL,
 col38 real NULL,
 {:<12s} geometry NULL,
 CONSTRAINT {:s}_pkey PRIMARY KEY (objectid),
 CONSTRAINT enforce_srid_{:s} CHECK (st_srid({:s}) = 4326)
)
WITH (
 OIDS=FALSE
)"""
listStatement="""SELECT (table_catalog ||'.' || table_schema || '.' || table_name)::varchar(100) as name
FROM information_schema.tables
WHERE table_type = 'BASE TABLE'
 AND table_schema NOT IN ('pg_catalog', 'information_schema')"""
conn = os.path.join(os.getcwd(),"bench.sde")
cursor = arcpy.ArcSDESQLExecute(conn)
for createStatement in [createStatement6, createStatement2, createStatement40]:
 count = 0
 print '----------'
 # .. (re)Create tables
 start = datetime.datetime.now()
 created=[]
 for i in range(total):
 if (math.floor(random.random() * 100) < fudge or (total - i) < (total - keep)):
 table = tableFmt.format(i)
 sql = dropStatement.format(schema,table)
 if (verbose): print("{:s};".format(sql))
 cursor.execute(sql)
 sql = createStatement.format(schema,table,geomcol,table,geomcol,geomcol)
 if (verbose): print("{:s};\n".format(sql))
 cursor.execute(sql)
 created.append(table)
 count += 1
 if (count >= keep): break
 elapsed = (datetime.datetime.now() - start).total_seconds()
 #print("{:>16s} = {:.4f} secs".format("Prep",elapsed))
 # .. Measure Exists
 allSum = 0.0
 exiSum = 0.0
 dneSum = 0.0
 for i in range(total):
 table = tableFmt.format(i)
 start = datetime.datetime.now()
 doesExist = arcpy.Exists(os.path.join(conn,"{:s}{:s}".format(schema,table)))
 elapsed = (datetime.datetime.now() - start).total_seconds()
 allSum += elapsed
 if (doesExist):
 exiSum += elapsed
 else:
 dneSum += elapsed
 if (verbose): print("{:s} ({:s})".format(table,"Y" if doesExist else "N"))
 print("{:>16s} = {:.4f}".format("Tot (all)",allSum))
 print("{:>16s} = {:.4f}".format("Mean (all)",allSum / float(total)))
 print("{:>16s} = {:.4f}".format("Mean (exi)",exiSum / float(keep)))
 print("{:>16s} = {:.4f}".format("Mean (dne)",dneSum / float(total-keep)))
 # .. Measure ListFC
 arcpy.env.workspace = conn
 start = datetime.datetime.now()
 allTables = arcpy.ListFeatureClasses()
 elapsed = (datetime.datetime.now() - start).total_seconds()
 print("{:>16s} = {:.4f}".format("List elapsed",elapsed))
 start = datetime.datetime.now()
 found = findTables(allTables,schema,tableFmt,total)
 elapsed += (datetime.datetime.now() - start).total_seconds()
 print("{:>16s} = {:.4f} ({:d} found)".format("List total",elapsed,len(found)))
 # .. Measure SQL
 start = datetime.datetime.now()
 sql = listStatement
 allTables = [row[0] for row in cursor.execute(sql)]
 elapsed = (datetime.datetime.now() - start).total_seconds()
 print("{:>16s} = {:.4f}".format("SQL elapsed",elapsed))
 start = datetime.datetime.now()
 found = findTables(allTables,schema,tableFmt,total)
 elapsed += (datetime.datetime.now() - start).total_seconds()
 print("{:>16s} = {:.4f} ({:d} found)".format("SQL total",elapsed,len(found)))
 # .. Cleanup
 start = datetime.datetime.now()
 for table in created:
 sql = dropStatement.format(schema,table)
 if (verbose): print("{:s};".format(sql))
 cursor.execute(sql)
 elapsed = (datetime.datetime.now() - start).total_seconds()
 #print("{:>16s} = {:.4f} secs".format("Cleanup",elapsed))
cursor = None
#EOF

*Note: hack job; took way longer than I wanted to make, so it's as-is.

Follow-up: I used this technique in an application against a PostgreSQL database and ran into an issue with failure to detect views and materialized views. Adding view support was as simple as changing the SQL that compiles the list to UNION ALL a query to information_schema.views:

 SELECT (table_schema || '.' || table_name)::varchar(255) as name
 FROM information_schema.tables
 WHERE table_type = 'BASE TABLE'
 AND table_schema NOT IN ('pg_catalog', 'information_schema')
UNION ALL
 SELECT (table_schema || '.' || table_name)::varchar(255) as name
 FROM information_schema.views
 WHERE table_schema NOT IN ('pg_catalog', 'information_schema')

But for PostgreSQL, at least, materialized views are not published through information_schema (because MATERIALIZED VIEW is an extension), so for them I needed to append a PG-specific catalog query:

UNION ALL
 SELECT (schemaname || '.' || matviewname)::varchar(255) as name
 FROM pg_matviews

Wow! I know comments are not meant to be a 'thank you' place, still wanted to point out that your answer is really thorough and interesting :)
I just added a test based on "SELECT count(*) FROM {:s}{:s} WHERE 1 = 0".format(schema,table), just for grins, and it came in at 120-260 milliseconds, which is way better than ListFeatureClasses but still an order of magnitude off the information_schema query. If your tables have rows, adding the "WHERE 1 = 0" is the canonical way to avoid the overhead of processing them.

Stack Exchange Network

Performance of arcpy.Exists() in SQL Server enterprise gdb environment

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Performance of arcpy.Exists() in SQL Server enterprise gdb environment

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions