Expanding url from shortened url obtained from tweet

Question 1

I have a twitter data set. I have extracted all the expanded urls from the json and now am trying to resolve the shortened ones. Also, I need to check which urls are still working and only keep those.

I am parsing over 5 million urls. The problem is that the code below is slow. Can anyone suggest how to make it faster? Is there a better way to do this?

import csv
import pandas as pd
from urllib2 import urlopen
import urllib2
import threading
import time
def urlResolution(url,tweetId,w):
 try:
 print "Entered Function"
 print "Original Url:",url
 hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
 'Accept-Encoding': 'none',
 'Accept-Language': 'en-US,en;q=0.8',
 'Connection': 'keep-alive'}
 #header has been added since some sites give an error otherwise
 req = urllib2.Request(url, headers=hdr)
 temp = urlopen(req)
 newUrl = temp.geturl()
 print "Resolved Url:",newUrl
 if newUrl!= 'None':
 print "in if condition"
 w.writerow([tweetId,newUrl])
 except Exception,e:
 print "Throwing exception"
 print str(e)
 return None
def urlResolver(urlFile):
 df=pd.read_csv(urlFile, delimiter="\t")
 df['Url']
 df2 = df[["Tweet ID","Url"]].copy()
 start = time.time()
 df3 = df2[df2.Url!="None"]
 list_url = []
 n=0
 w = csv.writer(open("OUTPUT_FILE.tsv", "w"), delimiter = '\t')
 w.writerow(["Tweet ID","Url"])
 maxC = 0
 while maxC < df3.shape[0]:
 #creates threads
 #only 40 threads are created at a time, since for large number of threads it gives <too many open files> error
 threads = [threading.Thread(target=urlResolution, args=(df3.iloc[n]['Url'],df3.iloc[n]['Tweet ID'],w)) for n in range(maxC,maxC+40)]
 for thread in threads:
 thread.start()
 for thread in threads:
 thread.join()
 if maxC+40 >= df3.shape[0]:
 threads = [threading.Thread(target=urlResolution, args=(df3.iloc[n]['Url'],df3.iloc[n]['Tweet ID'],w)) for n in range(maxC,df3.shape[0])]
 print "threads complete"
 for thread in threads:
 thread.start()
 for thread in threads:
 thread.join() 
 break
 maxC = maxC + 40
 print "Elapsed Time: %s" % (time.time() - start)
 w.close()
if __name__ == '__main__':
 df3 = urlResolver("INPUT_FILE.tsv")

Question 2

If twitter supported plain http connections, you could just connect to port 80 and read the Location header value in the301 response, without actually loading the whole page (which is why it's actually slow). But i doubt it's supported, and even if it is, it may mean a double redirection (http -> https -> real URL). You probably don't want to implement an https call from scratch, so I'd say what you've done is the best you can get.

Question 3

Couple things I'd try:

switch to requests module reusing the requests.Session() to let it reuse the same TCP connection:

..if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
use the "HEAD" HTTP method (in case of requests you may need the allow_redirects=True)
try out Scrapy web-scraping framework which is of an asynchronous nature and is based on the twisted network library. You would also move the CSV output part to an output pipeline.
another thing to try is use the grequests library (requests on gevent)

Some micro-optimization ideas:

move the hdr dictionary definition to the module level to avoid redefining it every time urlResolution() is called (and, since it is a constant use upper-case; and pick a more readable variable name - HEADERS?)

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Answer 1 · 2017-03-01 14:21:54Z

Couple things I'd try:

switch to requests module reusing the requests.Session() to let it reuse the same TCP connection:

..if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
use the "HEAD" HTTP method (in case of requests you may need the allow_redirects=True)
try out Scrapy web-scraping framework which is of an asynchronous nature and is based on the twisted network library. You would also move the CSV output part to an output pipeline.
another thing to try is use the grequests library (requests on gevent)

Some micro-optimization ideas:

move the hdr dictionary definition to the module level to avoid redefining it every time urlResolution() is called (and, since it is a constant use upper-case; and pick a more readable variable name - HEADERS?)

Stack Exchange Network

Expanding url from shortened url obtained from tweet

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Expanding url from shortened url obtained from tweet

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions