Too slow JSON data processing using SQL

Question 1

I have a case study where I need to take data from a REST API do some analysis on the data using an aggregate function, joins etc and use the response data in JSON format to plot some retail graphs.

Approaches being followed till now:

Read the data from JSON store these in python variable and use insert to hit the SQL query. Obviously, it is a costly operation because for every JSON line read it is inserting into the database. For 33k rows it is taking more than 20 mins which is inefficient.
This can be handled in elastic search for faster processing but complex operation like joins are not present in elastic search.

If anybody can suggest what would be the best approach (like preprocessing or post-processing in python) to follow for handling such scenarios it would be helpful.

SQL Script

def store_data(AccountNo)
 db=MySQLdb.connect(host=HOST, user=USER, passwd=PASSWD, db=DATABASE, charset="utf8")
 cursor = db.cursor()
 insert_query = "INSERT INTO cstore (AccountNo) VALUES (%s)"
 cursor.execute(insert_query, (AccountNo))
 db.commit()
 cursor.close()
 db.close()
 return
def on_data(file_path):
 #This is the meat of the script...it connects to your mongoDB and stores the tweet
 try:
 # Decode the JSON from Twitter
 testFile = open(file_path)
 datajson = json.load(testFile)
 #print (len(datajson))
 #grab the wanted data from the Tweet
 for i in range(len(datajson)):
 for cosponsor in datajson[i]:
 AccountNo=cosponsor['AccountNo']
 store_data( AccountNo)

Question 2

Opening a DB-connection is always costly. Could be done before loop.

Question 3

Here are a few things you can do to improve the performance of the data loading:

convert the JSON file to CSV and use LOAD DATA to load from a file (sample). It is probably the fastest way to do what you are trying to do.

use .executemany() instead of .execute():

datajson = json.load(testFile)
insert_query = """
 INSERT INTO 
 cstore (AccountNo) 
 VALUES (%(AccountNo)s)
"""
cursor.executemany(insert_query, datajson)

look into disabling/removing existing indexes during the insertion and then re-creating them after the insertion is done
make sure you are not doing the data-load "over the internet" and there is no network-latency and bandwidth impact, be "closer" to the database server
loading the JSON file is probably not a bottleneck, but you may also look into faster JSON parsers like ujson
I remember we also had some data-loading-performance-related success with the ultra-fast umysql Python database driver, but it looks like the package has not been maintained for quite a while

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2018-04-04 13:52:10Z

Here are a few things you can do to improve the performance of the data loading:

convert the JSON file to CSV and use LOAD DATA to load from a file (sample). It is probably the fastest way to do what you are trying to do.

use .executemany() instead of .execute():

datajson = json.load(testFile)
insert_query = """
 INSERT INTO 
 cstore (AccountNo) 
 VALUES (%(AccountNo)s)
"""
cursor.executemany(insert_query, datajson)

look into disabling/removing existing indexes during the insertion and then re-creating them after the insertion is done
make sure you are not doing the data-load "over the internet" and there is no network-latency and bandwidth impact, be "closer" to the database server
loading the JSON file is probably not a bottleneck, but you may also look into faster JSON parsers like ujson
I remember we also had some data-loading-performance-related success with the ultra-fast umysql Python database driver, but it looks like the package has not been maintained for quite a while

Stack Exchange Network

Too slow JSON data processing using SQL

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Too slow JSON data processing using SQL

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions