5
\$\begingroup\$

I am trying to create a small application that will help me extract data from Hadoop via hiveserver2 by simply writing it to CSV. At the moment, it takes about 40 seconds to pull 10,000 records from a hs2 table consisting of 500k rows with 203 columns. I would like to think that there is a faster, more efficient way of doing this than the way I am doing it now. Ideally I would like to eventually be able to pull and write 1 million rows per minute. It doesn't necessarily have to be written to CSV. It probably isn't feasible with my current skill level, but I like to set goals for myself.

import pyhs2
import time
import csv
csv_out = open('data.csv', 'wb')
mywriter = csv.writer(csv_out)
def generic_user():
 gu = 'xxxxxx'
 return gu
def password():
 pw = 'xxxxxxx'
 return pw
with pyhs2.connect(host='xxxxxxxx',
 port=10000,
 authMechanism='PLAIN',
 user=generic_user(),
 password=password(),
 database='xxxxxxxxxx') as conn:
 with conn.cursor() as cur:
 q = raw_input('Enter query: ').replace('csc', 'CSC')
 print q
 #timer start
 start_time = time.time()
 #Execute query
 cur.execute(q)
 col = []
 for key in cur.getSchema():
 col.append(key['columnName'])
 header = []
 header.append(col)
 for rows in zip(header):
 mywriter.writerows(rows)
 records = cur.fetch()
 # print records
 for rows in zip(records):
 mywriter.writerows(rows)
pull_time = time.time() - start_time
print pull_time
200_success
145k22 gold badges190 silver badges478 bronze badges
asked May 24, 2016 at 15:36
\$\endgroup\$
0

2 Answers 2

2
\$\begingroup\$

Speed improvements

Using a for loop to build a list is quite slow, when Python has 'list comprehensions'. They're for loop like expressions that build lists. So your col could be changed from this:

for key in cur.getSchema():
 col.append(key['columnName'])

to this:

col = [key['columnName'] for key in cur.getSchema()]

As far as I can tell, your header is actually pointless. You're making it just a single item list that then gets iterated over to write to your file. This seems to be the same as just mywriter.writerows(col).

Your use of zip is confusing. zip is used to join two or more lists when iterating. It's not normally used for a single list like you have, and doesn't really do much for you unless I'm missing something. It's a redundant drain on your time.

Other notes

You use context managers (with x as y) for the connection which is great, but you should do the same for your csv_out.

with open('data.csv', 'wb') as csv_out:

It's extra safety for your file the same way you have with the connections.

I'd also recommending refactoring your code into discrete functions rather than just one long script. It's better for readability and debugging purposes. Plus it makes it easier to change or reuse this code later.

answered May 26, 2016 at 11:10
\$\endgroup\$
1
  • \$\begingroup\$ Yes the header part is strange. The reason for it is because using col iterates each character s,o,,i,t,l,o,o,k,s,,l,i,k,e,,t,h,i,s, for some reason and the easiest way I knew to fix it was to just make another list appending it. the plan is to eventually break it into discrete functions. I wanted to get some better performance first. When I ran cProfile it appears that the majority of the time (72 of the 75 seconds) was spent fetching data. \$\endgroup\$ Commented May 26, 2016 at 14:00
1
\$\begingroup\$

Currently your code loads all results into a list in memory and then writes them to the csv file, rather than writing them to the csv file as it receives them. You might be able to speed up your code by changing this

 records = cur.fetch() ## <--- loads all results at once
 # print records
 for rows in zip(records):
 mywriter.writerows(rows)

to this:

from itertools import izip
for rows in izip(cur): ## <--- loads results in chunks as needed
 mywriter.writerows(rows)

izip returns an iterator rather than a list, so it generates values as needed rather than generating all of them upfront and storing them in memory in a list. (The reason zip and izip are needed at all is to convert each item in the iterator into a single-element tuple).

answered May 26, 2016 at 12:05
\$\endgroup\$
3
  • \$\begingroup\$ so would this be similar to fetchmany()? \$\endgroup\$ Commented May 26, 2016 at 14:00
  • \$\begingroup\$ From having a quick look at the source code (github.com/BradRuderman/pyhs2/blob/master/pyhs2/cursor.py#L143), yes I believe so. The difference being that fetchMany() only returns a limited number of results at once, so you would have to call it repeatedly to get all of the results, whereas if you just iterate over the cursor these repeated calls occur but are "hidden" from you. \$\endgroup\$ Commented May 26, 2016 at 14:13
  • \$\begingroup\$ Ah, thanks. izip improved memory use significantly and very marginally, the speed :) \$\endgroup\$ Commented May 26, 2016 at 14:41

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.