I have a SQL query which works for single checks:
SELECT trans_id from schema.table where trans_id like '%<trans_id>%'
There're might be better approaches to query for, but that's not the point.
The database has aprox. 150k entries and I should check 30k of them if the trans_id exists. The problem I face is, that I don't know if the normal approach with joining works, because the trans_id which have to be queried from are not in the database (unfortunately excel :/).
I'm not allowed to add them to the database to join them.
My idea was to create some kind of script which I trigger via psql: (researched) psql -U postgres -d database -o /absolute_path/textfile.txt << EOF Query1; Query2; Query ....; EOF
But in my expectation that would result in writing 30k lines of the select statements to the EOF section. I doubt this works, not even talking about the effort.
also the Output should be routed to an local file, which shows:
- trans_id exists
- trans_id doesn't exists
Maybe some Loop with an array? But I don't now how.
Performance is not my goal in the first place.
2 Answers 2
You may pass batches of values, or even all the 30k values in a query through a VALUES
clause. This is a common practice when querying a read-only server.
The query could look like this:
WITH list(pattern) as (
values ('abc'), ('def'), ('ghi')
)
SELECT pattern,
EXISTS (select 1 FROM tablename where trans_id like '%'||pattern||'%')
FROM list;
Be aware that the patterns should not contain %
or _
, or they should be escaped before being used with like
. Same with '
on the values being injected into the VALUES
clause.
EXISTS (subquery)
returns a boolean which will be displayed as t
or f
Depending on the string lengths and the kind of contents in this column, it might be faster to use strpos(trans_id, pattern)>0
instead of trans_id like '%'||pattern||'%'
. Both produce the same result if pattern
does not contain wildcards, but through different algorithms: recursive pattern matching with backtracking for like
versus Boyer-Moore-Horspool algorithm for strpos
.
-
Thanks for this query. I have applied it and it worked partly: 30k of entries is still to large to handle -> Error message says: Request entity to large. I have to split it in to smaller sections unfortunately, but your query is a big help! Thanks a lotdbalucas– dbalucas2022年10月24日 06:28:52 +00:00Commented Oct 24, 2022 at 6:28
The query of Daniel Verite was really helpfull. Unfortunatelly after aprox 20sec query runtime I had timeout performance issues. Due I'm not allowed to adjust the config I wrote a looped script to workaround the problem.
This fitted my purposes best:
Solution as follows: -> create a conf file to provide all paramters of course -> usage of psycopg2 for database connects via postgres and python -> numpy for handling I/O files
import psycopg2
from psycopg2 import Error
import numpy as np
from configparser import SafeConfigParser
def get_config_parameter(conf_file,section, parameter):
config_object = SafeConfigParser()
config_object.read(conf_file)
return str(config_object.get(section ,parameter))
dbpassword = get_config_parameter('db.conf', 'PROD ENV', 'dbusername')
dbusername = get_config_parameter('db.conf', 'PROD ENV', 'dbusername')
dbhost = get_config_parameter('db.conf', 'PROD ENV', 'dbhost')
dbname = get_config_parameter('db.conf', 'PROD ENV', 'dbname')
input_file_name = get_config_parameter('db.conf', 'PROD ENV', 'input_file_name')
output_result_file = get_config_parameter('db.conf', 'PROD ENV', 'output_result_file')
chunksize = 500
eculist = np.loadtxt(input_file_name, dtype="str")
ecu_chunks = ""
ecu_chunk_list = []
for ecu in eculist:
ecu_chunk_list.append(f"('{ecu}'), ")
def query(dbusername, dbpassword, dbhost, dbname, query_input):
try:
connection = psycopg2.connect(user=dbusername, password=dbpassword, host=dbhost, database=dbname)
# create a cursor to perform database operations
cursor = connection.cursor()
print("CONNECTED:", connection.get_dsn_parameters(), ": \n")
try:
query = f"WITH list(pattern) as (values {query_input}) SELECT pattern, EXISTS (select 1 FROM mule.transactions WHERE transaction_id LIKE '%'||pattern||'%') FROM list;"
print('query in progress... please wait')
cursor.execute(query)
result = str(cursor.fetchall())
return result
except (Exception, Error) as error:
print("SQL: Error while executing query:", error)
except (Exception, Error) as error:
print("NOT CONNECTED: Error while connection to Database:", error)
finally:
if connection:
cursor.close()
connection.close()
print("Database connection closed!")
result = ""
out_text = ""
for i in range(0,len(ecu_chunk_list),chunksize):
test = str(ecu_chunk_list[i:i+chunksize]).replace('", "','').replace('"','')
length = len(test)
test = test[1:length -3]
result = query(dbusername,dbpassword,dbhost,dbname, test)
out_text = out_text + result
save_file = open(output_result_file, 'w+')
save_file.write(out_text)
save_file.close()
Of course you can get rid of all the fail-safe but I think it's usefull.
SELECT trans_id FROM schema.table WHERE trans_id IN (<list of ids>
- if you're just checking for existence, I'm not sure you'd want to useLIKE
or any other fuzzy operator.