I have a case study where I need to take data from a REST API do some analysis on the data using an aggregate function, joins etc and use the response data in JSON format to plot some retail graphs.
Approaches being followed till now:
Read the data from JSON store these in python variable and use insert to hit the SQL query. Obviously, it is a costly operation because for every JSON line read it is inserting into the database. For 33k rows it is taking more than 20 mins which is inefficient.
This can be handled in elastic search for faster processing but complex operation like joins are not present in elastic search.
If anybody can suggest what would be the best approach (like preprocessing or post-processing in python) to follow for handling such scenarios it would be helpful.
SQL Script
def store_data(AccountNo)
db=MySQLdb.connect(host=HOST, user=USER, passwd=PASSWD, db=DATABASE, charset="utf8")
cursor = db.cursor()
insert_query = "INSERT INTO cstore (AccountNo) VALUES (%s)"
cursor.execute(insert_query, (AccountNo))
db.commit()
cursor.close()
db.close()
return
def on_data(file_path):
#This is the meat of the script...it connects to your mongoDB and stores the tweet
try:
# Decode the JSON from Twitter
testFile = open(file_path)
datajson = json.load(testFile)
#print (len(datajson))
#grab the wanted data from the Tweet
for i in range(len(datajson)):
for cosponsor in datajson[i]:
AccountNo=cosponsor['AccountNo']
store_data( AccountNo)
-
\$\begingroup\$ Opening a DB-connection is always costly. Could be done before loop. \$\endgroup\$hc_dev– hc_dev2020年04月17日 21:19:49 +00:00Commented Apr 17, 2020 at 21:19
1 Answer 1
Here are a few things you can do to improve the performance of the data loading:
- convert the JSON file to CSV and use
LOAD DATA
to load from a file (sample). It is probably the fastest way to do what you are trying to do. use
.executemany()
instead of.execute()
:datajson = json.load(testFile) insert_query = """ INSERT INTO cstore (AccountNo) VALUES (%(AccountNo)s) """ cursor.executemany(insert_query, datajson)
- look into disabling/removing existing indexes during the insertion and then re-creating them after the insertion is done
- make sure you are not doing the data-load "over the internet" and there is no network-latency and bandwidth impact, be "closer" to the database server
- loading the JSON file is probably not a bottleneck, but you may also look into faster JSON parsers like
ujson
- I remember we also had some data-loading-performance-related success with the ultra-fast
umysql
Python database driver, but it looks like the package has not been maintained for quite a while