6
\$\begingroup\$

I have a twitter data set. I have extracted all the expanded urls from the json and now am trying to resolve the shortened ones. Also, I need to check which urls are still working and only keep those.

I am parsing over 5 million urls. The problem is that the code below is slow. Can anyone suggest how to make it faster? Is there a better way to do this?

import csv
import pandas as pd
from urllib2 import urlopen
import urllib2
import threading
import time
def urlResolution(url,tweetId,w):
 try:
 print "Entered Function"
 print "Original Url:",url
 hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
 'Accept-Encoding': 'none',
 'Accept-Language': 'en-US,en;q=0.8',
 'Connection': 'keep-alive'}
 #header has been added since some sites give an error otherwise
 req = urllib2.Request(url, headers=hdr)
 temp = urlopen(req)
 newUrl = temp.geturl()
 print "Resolved Url:",newUrl
 if newUrl!= 'None':
 print "in if condition"
 w.writerow([tweetId,newUrl])
 except Exception,e:
 print "Throwing exception"
 print str(e)
 return None
def urlResolver(urlFile):
 df=pd.read_csv(urlFile, delimiter="\t")
 df['Url']
 df2 = df[["Tweet ID","Url"]].copy()
 start = time.time()
 df3 = df2[df2.Url!="None"]
 list_url = []
 n=0
 w = csv.writer(open("OUTPUT_FILE.tsv", "w"), delimiter = '\t')
 w.writerow(["Tweet ID","Url"])
 maxC = 0
 while maxC < df3.shape[0]:
 #creates threads
 #only 40 threads are created at a time, since for large number of threads it gives <too many open files> error
 threads = [threading.Thread(target=urlResolution, args=(df3.iloc[n]['Url'],df3.iloc[n]['Tweet ID'],w)) for n in range(maxC,maxC+40)]
 for thread in threads:
 thread.start()
 for thread in threads:
 thread.join()
 if maxC+40 >= df3.shape[0]:
 threads = [threading.Thread(target=urlResolution, args=(df3.iloc[n]['Url'],df3.iloc[n]['Tweet ID'],w)) for n in range(maxC,df3.shape[0])]
 print "threads complete"
 for thread in threads:
 thread.start()
 for thread in threads:
 thread.join() 
 break
 maxC = maxC + 40
 print "Elapsed Time: %s" % (time.time() - start)
 w.close()
if __name__ == '__main__':
 df3 = urlResolver("INPUT_FILE.tsv")
Peilonrayz
44.4k7 gold badges80 silver badges157 bronze badges
asked Mar 1, 2017 at 5:19
\$\endgroup\$
1
  • \$\begingroup\$ If twitter supported plain http connections, you could just connect to port 80 and read the Location header value in the301 response, without actually loading the whole page (which is why it's actually slow). But i doubt it's supported, and even if it is, it may mean a double redirection (http -> https -> real URL). You probably don't want to implement an https call from scratch, so I'd say what you've done is the best you can get. \$\endgroup\$ Commented Mar 1, 2017 at 10:14

1 Answer 1

2
\$\begingroup\$

Couple things I'd try:

Some micro-optimization ideas:

  • move the hdr dictionary definition to the module level to avoid redefining it every time urlResolution() is called (and, since it is a constant use upper-case; and pick a more readable variable name - HEADERS?)
answered Mar 1, 2017 at 14:21
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.