0

I have a SQL query which works for single checks: SELECT trans_id from schema.table where trans_id like '%<trans_id>%' There're might be better approaches to query for, but that's not the point.

The database has aprox. 150k entries and I should check 30k of them if the trans_id exists. The problem I face is, that I don't know if the normal approach with joining works, because the trans_id which have to be queried from are not in the database (unfortunately excel :/).

I'm not allowed to add them to the database to join them.

My idea was to create some kind of script which I trigger via psql: (researched) psql -U postgres -d database -o /absolute_path/textfile.txt << EOF Query1; Query2; Query ....; EOF

But in my expectation that would result in writing 30k lines of the select statements to the EOF section. I doubt this works, not even talking about the effort.

also the Output should be routed to an local file, which shows:

  • trans_id exists
  • trans_id doesn't exists

Maybe some Loop with an array? But I don't now how.

Performance is not my goal in the first place.

asked Oct 20, 2022 at 6:42
1
  • Have you tried modifying your query to be along the lines of SELECT trans_id FROM schema.table WHERE trans_id IN (<list of ids> - if you're just checking for existence, I'm not sure you'd want to use LIKE or any other fuzzy operator. Commented Oct 20, 2022 at 14:59

2 Answers 2

1

You may pass batches of values, or even all the 30k values in a query through a VALUES clause. This is a common practice when querying a read-only server.

The query could look like this:

WITH list(pattern) as (
 values ('abc'), ('def'), ('ghi')
)
SELECT pattern,
 EXISTS (select 1 FROM tablename where trans_id like '%'||pattern||'%')
FROM list;

Be aware that the patterns should not contain % or _, or they should be escaped before being used with like. Same with ' on the values being injected into the VALUES clause.

EXISTS (subquery) returns a boolean which will be displayed as t or f

Depending on the string lengths and the kind of contents in this column, it might be faster to use strpos(trans_id, pattern)>0 instead of trans_id like '%'||pattern||'%'. Both produce the same result if pattern does not contain wildcards, but through different algorithms: recursive pattern matching with backtracking for like versus Boyer-Moore-Horspool algorithm for strpos.

answered Oct 21, 2022 at 12:15
1
  • Thanks for this query. I have applied it and it worked partly: 30k of entries is still to large to handle -> Error message says: Request entity to large. I have to split it in to smaller sections unfortunately, but your query is a big help! Thanks a lot Commented Oct 24, 2022 at 6:28
0

The query of Daniel Verite was really helpfull. Unfortunatelly after aprox 20sec query runtime I had timeout performance issues. Due I'm not allowed to adjust the config I wrote a looped script to workaround the problem.

This fitted my purposes best:

Solution as follows: -> create a conf file to provide all paramters of course -> usage of psycopg2 for database connects via postgres and python -> numpy for handling I/O files

import psycopg2
from psycopg2 import Error
import numpy as np
from configparser import SafeConfigParser
def get_config_parameter(conf_file,section, parameter):
 config_object = SafeConfigParser()
 config_object.read(conf_file)
 
 return str(config_object.get(section ,parameter))
dbpassword = get_config_parameter('db.conf', 'PROD ENV', 'dbusername')
dbusername = get_config_parameter('db.conf', 'PROD ENV', 'dbusername')
dbhost = get_config_parameter('db.conf', 'PROD ENV', 'dbhost')
dbname = get_config_parameter('db.conf', 'PROD ENV', 'dbname')
input_file_name = get_config_parameter('db.conf', 'PROD ENV', 'input_file_name')
output_result_file = get_config_parameter('db.conf', 'PROD ENV', 'output_result_file')
chunksize = 500
eculist = np.loadtxt(input_file_name, dtype="str")
ecu_chunks = ""
ecu_chunk_list = []
for ecu in eculist:
 ecu_chunk_list.append(f"('{ecu}'), ")
def query(dbusername, dbpassword, dbhost, dbname, query_input):
 try:
 connection = psycopg2.connect(user=dbusername, password=dbpassword, host=dbhost, database=dbname)
 # create a cursor to perform database operations
 cursor = connection.cursor()
 print("CONNECTED:", connection.get_dsn_parameters(), ": \n")
 try:
 query = f"WITH list(pattern) as (values {query_input}) SELECT pattern, EXISTS (select 1 FROM mule.transactions WHERE transaction_id LIKE '%'||pattern||'%') FROM list;"
 print('query in progress... please wait')
 cursor.execute(query)
 result = str(cursor.fetchall())
 return result
 except (Exception, Error) as error:
 print("SQL: Error while executing query:", error)
 except (Exception, Error) as error:
 print("NOT CONNECTED: Error while connection to Database:", error)
 finally:
 if connection:
 cursor.close()
 connection.close()
 print("Database connection closed!")
result = ""
out_text = ""
for i in range(0,len(ecu_chunk_list),chunksize):
 test = str(ecu_chunk_list[i:i+chunksize]).replace('", "','').replace('"','')
 length = len(test)
 test = test[1:length -3]
 result = query(dbusername,dbpassword,dbhost,dbname, test)
 out_text = out_text + result
save_file = open(output_result_file, 'w+')
save_file.write(out_text)
save_file.close()

Of course you can get rid of all the fail-safe but I think it's usefull.

answered Oct 28, 2022 at 10:14

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.