How to query 30k of entries from an external data source by script

Question 1

I have a SQL query which works for single checks: SELECT trans_id from schema.table where trans_id like '%<trans_id>%' There're might be better approaches to query for, but that's not the point.

The database has aprox. 150k entries and I should check 30k of them if the trans_id exists. The problem I face is, that I don't know if the normal approach with joining works, because the trans_id which have to be queried from are not in the database (unfortunately excel :/).

I'm not allowed to add them to the database to join them.

My idea was to create some kind of script which I trigger via psql: (researched) psql -U postgres -d database -o /absolute_path/textfile.txt << EOF Query1; Query2; Query ....; EOF

But in my expectation that would result in writing 30k lines of the select statements to the EOF section. I doubt this works, not even talking about the effort.

also the Output should be routed to an local file, which shows:

trans_id exists
trans_id doesn't exists

Maybe some Loop with an array? But I don't now how.

Performance is not my goal in the first place.

Question 2

Have you tried modifying your query to be along the lines of SELECT trans_id FROM schema.table WHERE trans_id IN (<list of ids> - if you're just checking for existence, I'm not sure you'd want to use LIKE or any other fuzzy operator.

Question 3

You may pass batches of values, or even all the 30k values in a query through a VALUES clause. This is a common practice when querying a read-only server.

The query could look like this:

WITH list(pattern) as (
 values ('abc'), ('def'), ('ghi')
)
SELECT pattern,
 EXISTS (select 1 FROM tablename where trans_id like '%'||pattern||'%')
FROM list;

Be aware that the patterns should not contain % or _, or they should be escaped before being used with like. Same with ' on the values being injected into the VALUES clause.

EXISTS (subquery) returns a boolean which will be displayed as t or f

Depending on the string lengths and the kind of contents in this column, it might be faster to use strpos(trans_id, pattern)>0 instead of trans_id like '%'||pattern||'%'. Both produce the same result if pattern does not contain wildcards, but through different algorithms: recursive pattern matching with backtracking for like versus Boyer-Moore-Horspool algorithm for strpos.

Question 4

Thanks for this query. I have applied it and it worked partly: 30k of entries is still to large to handle -> Error message says: Request entity to large. I have to split it in to smaller sections unfortunately, but your query is a big help! Thanks a lot

Question 5

The query of Daniel Verite was really helpfull. Unfortunatelly after aprox 20sec query runtime I had timeout performance issues. Due I'm not allowed to adjust the config I wrote a looped script to workaround the problem.

This fitted my purposes best:

Solution as follows: -> create a conf file to provide all paramters of course -> usage of psycopg2 for database connects via postgres and python -> numpy for handling I/O files

import psycopg2
from psycopg2 import Error
import numpy as np
from configparser import SafeConfigParser
def get_config_parameter(conf_file,section, parameter):
 config_object = SafeConfigParser()
 config_object.read(conf_file)
 
 return str(config_object.get(section ,parameter))
dbpassword = get_config_parameter('db.conf', 'PROD ENV', 'dbusername')
dbusername = get_config_parameter('db.conf', 'PROD ENV', 'dbusername')
dbhost = get_config_parameter('db.conf', 'PROD ENV', 'dbhost')
dbname = get_config_parameter('db.conf', 'PROD ENV', 'dbname')
input_file_name = get_config_parameter('db.conf', 'PROD ENV', 'input_file_name')
output_result_file = get_config_parameter('db.conf', 'PROD ENV', 'output_result_file')
chunksize = 500
eculist = np.loadtxt(input_file_name, dtype="str")
ecu_chunks = ""
ecu_chunk_list = []
for ecu in eculist:
 ecu_chunk_list.append(f"('{ecu}'), ")
def query(dbusername, dbpassword, dbhost, dbname, query_input):
 try:
 connection = psycopg2.connect(user=dbusername, password=dbpassword, host=dbhost, database=dbname)
 # create a cursor to perform database operations
 cursor = connection.cursor()
 print("CONNECTED:", connection.get_dsn_parameters(), ": \n")
 try:
 query = f"WITH list(pattern) as (values {query_input}) SELECT pattern, EXISTS (select 1 FROM mule.transactions WHERE transaction_id LIKE '%'||pattern||'%') FROM list;"
 print('query in progress... please wait')
 cursor.execute(query)
 result = str(cursor.fetchall())
 return result
 except (Exception, Error) as error:
 print("SQL: Error while executing query:", error)
 except (Exception, Error) as error:
 print("NOT CONNECTED: Error while connection to Database:", error)
 finally:
 if connection:
 cursor.close()
 connection.close()
 print("Database connection closed!")
result = ""
out_text = ""
for i in range(0,len(ecu_chunk_list),chunksize):
 test = str(ecu_chunk_list[i:i+chunksize]).replace('", "','').replace('"','')
 length = len(test)
 test = test[1:length -3]
 result = query(dbusername,dbpassword,dbhost,dbname, test)
 out_text = out_text + result
save_file = open(output_result_file, 'w+')
save_file.write(out_text)
save_file.close()

Of course you can get rid of all the fail-safe but I think it's usefull.

score 1 · Answer 1 · 2022-10-21 12:15:12Z

You may pass batches of values, or even all the 30k values in a query through a VALUES clause. This is a common practice when querying a read-only server.

The query could look like this:

WITH list(pattern) as (
 values ('abc'), ('def'), ('ghi')
)
SELECT pattern,
 EXISTS (select 1 FROM tablename where trans_id like '%'||pattern||'%')
FROM list;

Be aware that the patterns should not contain % or _, or they should be escaped before being used with like. Same with ' on the values being injected into the VALUES clause.

EXISTS (subquery) returns a boolean which will be displayed as t or f

Depending on the string lengths and the kind of contents in this column, it might be faster to use strpos(trans_id, pattern)>0 instead of trans_id like '%'||pattern||'%'. Both produce the same result if pattern does not contain wildcards, but through different algorithms: recursive pattern matching with backtracking for like versus Boyer-Moore-Horspool algorithm for strpos.

Thanks for this query. I have applied it and it worked partly: 30k of entries is still to large to handle -> Error message says: Request entity to large. I have to split it in to smaller sections unfortunately, but your query is a big help! Thanks a lot

dbalucas dbalucas 365 bronze badges · Answer 2 · 2022-10-28 10:14:00Z

The query of Daniel Verite was really helpfull. Unfortunatelly after aprox 20sec query runtime I had timeout performance issues. Due I'm not allowed to adjust the config I wrote a looped script to workaround the problem.

This fitted my purposes best:

Solution as follows: -> create a conf file to provide all paramters of course -> usage of psycopg2 for database connects via postgres and python -> numpy for handling I/O files

import psycopg2
from psycopg2 import Error
import numpy as np
from configparser import SafeConfigParser
def get_config_parameter(conf_file,section, parameter):
 config_object = SafeConfigParser()
 config_object.read(conf_file)
 
 return str(config_object.get(section ,parameter))
dbpassword = get_config_parameter('db.conf', 'PROD ENV', 'dbusername')
dbusername = get_config_parameter('db.conf', 'PROD ENV', 'dbusername')
dbhost = get_config_parameter('db.conf', 'PROD ENV', 'dbhost')
dbname = get_config_parameter('db.conf', 'PROD ENV', 'dbname')
input_file_name = get_config_parameter('db.conf', 'PROD ENV', 'input_file_name')
output_result_file = get_config_parameter('db.conf', 'PROD ENV', 'output_result_file')
chunksize = 500
eculist = np.loadtxt(input_file_name, dtype="str")
ecu_chunks = ""
ecu_chunk_list = []
for ecu in eculist:
 ecu_chunk_list.append(f"('{ecu}'), ")
def query(dbusername, dbpassword, dbhost, dbname, query_input):
 try:
 connection = psycopg2.connect(user=dbusername, password=dbpassword, host=dbhost, database=dbname)
 # create a cursor to perform database operations
 cursor = connection.cursor()
 print("CONNECTED:", connection.get_dsn_parameters(), ": \n")
 try:
 query = f"WITH list(pattern) as (values {query_input}) SELECT pattern, EXISTS (select 1 FROM mule.transactions WHERE transaction_id LIKE '%'||pattern||'%') FROM list;"
 print('query in progress... please wait')
 cursor.execute(query)
 result = str(cursor.fetchall())
 return result
 except (Exception, Error) as error:
 print("SQL: Error while executing query:", error)
 except (Exception, Error) as error:
 print("NOT CONNECTED: Error while connection to Database:", error)
 finally:
 if connection:
 cursor.close()
 connection.close()
 print("Database connection closed!")
result = ""
out_text = ""
for i in range(0,len(ecu_chunk_list),chunksize):
 test = str(ecu_chunk_list[i:i+chunksize]).replace('", "','').replace('"','')
 length = len(test)
 test = test[1:length -3]
 result = query(dbusername,dbpassword,dbhost,dbname, test)
 out_text = out_text + result
save_file = open(output_result_file, 'w+')
save_file.write(out_text)
save_file.close()

Of course you can get rid of all the fail-safe but I think it's usefull.

Stack Exchange Network

How to query 30k of entries from an external data source by script

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to query 30k of entries from an external data source by script

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions