How can I speed up this python script to read and process a csv file?

Question 1

I am trying to process a relatively large (about 100k lines) csv file in python. This is what my code looks like:

#!/usr/bin/env python
import sys
reload(sys)
sys.setdefaultencoding("utf8")
import csv
import os
csvFileName = sys.argv[1]
with open(csvFileName, 'r') as inputFile:
 parsedFile = csv.DictReader(inputFile, delimiter=',')
 totalCount = 0
 for row in parsedFile:
 target = row['new']
 source = row['old']
 systemLine = "some_curl_command {source}, {target}".format(source = source, target = target)
 os.system(systemLine)
 totalCount += 1
 print "\nProcessed number: " + str(totalCount)

I'm not sure how to optimize this script. Should I use something besides DictReader?

I have to use Python 2.7, and cannot upgrade to Python 3.

Question 2

The problem does not lie in how you're reading the CSV, but rather that you're shelling out to curl for each row of the file. Instead: 1. use native Python code to retrieve the URL and 2. use multithreading to make multiple requests at once.

Question 3

Is there anything else I can do? I'm new to python, and I don't want to start messing about with multithreading.

Question 4

No. 99% of the runtime of the script is your script on the Web request, because you are waiting for each to complete before starting the next. To avoid this, you must run more than one at a time.

Question 5

If you want to avoid multiprocessing it is possible to split your long csv file into few smaller csvs and run them simultaneously. Like
```
$ python your_script.py 1.csv &
$ python your_script.py 2.csv & 
```

Ampersand stands for background execution in linux envs. More details here. I don't have enough knowledge about anything similar in Windows, but it's possible to open few cmd windows, lol.

Anyway it's much better to stick with multiprocessing, ofc.

What about to use requests instead of curl?

import requests
response = requests.get(source_url)
html = response.content
with open(target, "w") as file:
 file.write(html)

Here's the doc.

Avoid print statements, in long-term run they're slow as hell. For development and debugging that's ok, but when you decide to start final execution of your script you can remove it and check count of processed files directly in the target folder.

Question 6

running

subprocess.Popen(systemLine)

instead of

os.system(systemLine)

should speed things up. Please note that systemLine has to be a list of strings e.g ['some_curl_command', 'source', 'target'] in order to work. If you want to limit the number of concurrent commands have a look at that.

juggernaut 1,0481 gold badge12 silver badges19 bronze badges · Answer 1 · 2017-07-25 20:38:57Z

If you want to avoid multiprocessing it is possible to split your long csv file into few smaller csvs and run them simultaneously. Like
```
$ python your_script.py 1.csv &
$ python your_script.py 2.csv & 
```

Ampersand stands for background execution in linux envs. More details here. I don't have enough knowledge about anything similar in Windows, but it's possible to open few cmd windows, lol.

Anyway it's much better to stick with multiprocessing, ofc.

What about to use requests instead of curl?

import requests
response = requests.get(source_url)
html = response.content
with open(target, "w") as file:
 file.write(html)

Here's the doc.

Avoid print statements, in long-term run they're slow as hell. For development and debugging that's ok, but when you decide to start final execution of your script you can remove it and check count of processed files directly in the target folder.

f1nan 15610 bronze badges · Answer 2 · 2017-07-25 20:58:53Z

running

subprocess.Popen(systemLine)

instead of

os.system(systemLine)

should speed things up. Please note that systemLine has to be a list of strings e.g ['some_curl_command', 'source', 'target'] in order to work. If you want to limit the number of concurrent commands have a look at that.

CollectivesTM on Stack Overflow

How can I speed up this python script to read and process a csv file?

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related