I have a twitter data set. I have extracted all the expanded urls from the json and now am trying to resolve the shortened ones. Also, I need to check which urls are still working and only keep those.
I am parsing over 5 million urls. The problem is that the code below is slow. Can anyone suggest how to make it faster? Is there a better way to do this?
import csv
import pandas as pd
from urllib2 import urlopen
import urllib2
import threading
import time
def urlResolution(url,tweetId,w):
try:
print "Entered Function"
print "Original Url:",url
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
#header has been added since some sites give an error otherwise
req = urllib2.Request(url, headers=hdr)
temp = urlopen(req)
newUrl = temp.geturl()
print "Resolved Url:",newUrl
if newUrl!= 'None':
print "in if condition"
w.writerow([tweetId,newUrl])
except Exception,e:
print "Throwing exception"
print str(e)
return None
def urlResolver(urlFile):
df=pd.read_csv(urlFile, delimiter="\t")
df['Url']
df2 = df[["Tweet ID","Url"]].copy()
start = time.time()
df3 = df2[df2.Url!="None"]
list_url = []
n=0
w = csv.writer(open("OUTPUT_FILE.tsv", "w"), delimiter = '\t')
w.writerow(["Tweet ID","Url"])
maxC = 0
while maxC < df3.shape[0]:
#creates threads
#only 40 threads are created at a time, since for large number of threads it gives <too many open files> error
threads = [threading.Thread(target=urlResolution, args=(df3.iloc[n]['Url'],df3.iloc[n]['Tweet ID'],w)) for n in range(maxC,maxC+40)]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
if maxC+40 >= df3.shape[0]:
threads = [threading.Thread(target=urlResolution, args=(df3.iloc[n]['Url'],df3.iloc[n]['Tweet ID'],w)) for n in range(maxC,df3.shape[0])]
print "threads complete"
for thread in threads:
thread.start()
for thread in threads:
thread.join()
break
maxC = maxC + 40
print "Elapsed Time: %s" % (time.time() - start)
w.close()
if __name__ == '__main__':
df3 = urlResolver("INPUT_FILE.tsv")
1 Answer 1
Couple things I'd try:
switch to
requests
module reusing therequests.Session()
to let it reuse the same TCP connection:..if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
use the "HEAD" HTTP method (in case of
requests
you may need theallow_redirects=True
)- try out
Scrapy
web-scraping framework which is of an asynchronous nature and is based on thetwisted
network library. You would also move the CSV output part to an output pipeline. - another thing to try is use the
grequests
library (requests
ongevent
)
Some micro-optimization ideas:
- move the
hdr
dictionary definition to the module level to avoid redefining it every timeurlResolution()
is called (and, since it is a constant use upper-case; and pick a more readable variable name -HEADERS
?)
Explore related questions
See similar questions with these tags.
http
connections, you could just connect to port80
and read theLocation
header value in the301
response, without actually loading the whole page (which is why it's actually slow). But i doubt it's supported, and even if it is, it may mean a double redirection (http -> https -> real URL). You probably don't want to implement an https call from scratch, so I'd say what you've done is the best you can get. \$\endgroup\$