0

I am trying to process a relatively large (about 100k lines) csv file in python. This is what my code looks like:

#!/usr/bin/env python
import sys
reload(sys)
sys.setdefaultencoding("utf8")
import csv
import os
csvFileName = sys.argv[1]
with open(csvFileName, 'r') as inputFile:
 parsedFile = csv.DictReader(inputFile, delimiter=',')
 totalCount = 0
 for row in parsedFile:
 target = row['new']
 source = row['old']
 systemLine = "some_curl_command {source}, {target}".format(source = source, target = target)
 os.system(systemLine)
 totalCount += 1
 print "\nProcessed number: " + str(totalCount)

I'm not sure how to optimize this script. Should I use something besides DictReader?

I have to use Python 2.7, and cannot upgrade to Python 3.

asked Jul 25, 2017 at 19:08
3
  • 1
    The problem does not lie in how you're reading the CSV, but rather that you're shelling out to curl for each row of the file. Instead: 1. use native Python code to retrieve the URL and 2. use multithreading to make multiple requests at once. Commented Jul 25, 2017 at 19:10
  • Is there anything else I can do? I'm new to python, and I don't want to start messing about with multithreading. Commented Jul 25, 2017 at 19:31
  • 1
    No. 99% of the runtime of the script is your script on the Web request, because you are waiting for each to complete before starting the next. To avoid this, you must run more than one at a time. Commented Jul 26, 2017 at 0:23

2 Answers 2

0
  1. If you want to avoid multiprocessing it is possible to split your long csv file into few smaller csvs and run them simultaneously. Like

    $ python your_script.py 1.csv &
    $ python your_script.py 2.csv & 
    

Ampersand stands for background execution in linux envs. More details here. I don't have enough knowledge about anything similar in Windows, but it's possible to open few cmd windows, lol.

Anyway it's much better to stick with multiprocessing, ofc.

  1. What about to use requests instead of curl?

    import requests
    response = requests.get(source_url)
    html = response.content
    with open(target, "w") as file:
     file.write(html)
    

Here's the doc.

  1. Avoid print statements, in long-term run they're slow as hell. For development and debugging that's ok, but when you decide to start final execution of your script you can remove it and check count of processed files directly in the target folder.
answered Jul 25, 2017 at 20:38
Sign up to request clarification or add additional context in comments.

Comments

0

running

subprocess.Popen(systemLine)

instead of

os.system(systemLine)

should speed things up. Please note that systemLine has to be a list of strings e.g ['some_curl_command', 'source', 'target'] in order to work. If you want to limit the number of concurrent commands have a look at that.

answered Jul 25, 2017 at 20:58

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.