2
\$\begingroup\$

The whole process mainly depends on the ProcessFile() function:

I get all the credentials and details from a JSON file.

  1. Initially I get data from the db: two IDs and a URL.

  2. There is a text file for address state matching, reading, and compiling with | to match regex.

  3. The same is done for zip and zip keywords.

  4. Iterate over the row.

  5. Get content of the URL and remove unwanted things.

  6. Match the address regex and store the result to a variable.

  7. Match the zip regex, see if it is available in the zip keyword and write the result to variable append the total results to outputs list. Write only when the list size is equal or greater than 1000.

  8. After completing all the URLs, see if outputs contains data. If so, write to the file, else leave it.

  9. Move the result to another FTP location.

#Importing modules
import requests
import re
import sys
import pyodbc
import json
import datetime
import random
import time
import ftplib
import traceback
#loading regex files
start_time_split=datetime.datetime.now()
json_data=open('server_details.json')
server_details = json.load(json_data)
json_data.close() 
'''SQL SERVER CREDENTIALS AND DB NAME'''
SERVER=server_details["SERVER"]
DATABASE=server_details["DATABASE"]
UID=server_details["UID"]
PWD=server_details["PWD"]
ADDRESS=server_details["ADDRESS"]
ZIP=server_details["ZIP"]
ZIP_WORD=server_details["ZIP_WORD"]
FTP_IP=server_details["FTP_IP"]
FTP_USER=server_details["FTP_USER"]
FTP_PWD=server_details["FTP_PWD"]
FTP_LOCATION=server_details["FTP_LOCATION"]
def ProcessFile(fileReadStr):
 try:
 header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0'}
 with open(ADDRESS,"r") as l:
 address="|".join([a.strip() for a in l])
 #address=address 
 with open(ZIP,"r") as ll:
 regex="|".join([a.strip() for a in ll])
 #regex=regex
 with open(ZIP_WORD) as inp:
 zip_word=[a.strip() for a in inp.readlines()]
 print "Started "+str(start_time_split)
 filename=str(random.randint(1,100000))+"_"+str(int(time.time()))+"_us_link_classifier.txt"
 with open(filename,"w") as output:
 output.write("ID\tClientID\tURL\tAddress\tword\tzip\tzipmatch\n")
 outputs=[]
 for line in fileReadStr:
 id1=line[0]
 web=line[1] if line[1].strip().startswith("http") else "http://"+line[1].strip()
 id2=line[2] 
 ad="0"
 word="None" 
 try:
 objReq = requests.get(web,timeout=30,verify=False,allow_redirects=True,headers=header)
 htmlStr = objReq.content
 htmlStr=re.sub("\s+"," ",htmlStr)
 htmlStr=re.sub("<!--.*?-->", " ", htmlStr)
 htmlStr=re.sub("<head.*?>.*?</head>"," ",htmlStr)
 htmlStr=re.sub("<.*?>", " ", htmlStr) 
 htmlStr=re.sub("\s+", " ", htmlStr)
 address_match=re.search(address,htmlStr,re.IGNORECASE)
 if address_match :
 ad="1"
 word=address_match.group(0).strip()
 address_zip=re.findall(regex,htmlStr)
 zip_match =[el for sub in address_zip for el in sub if el and el.split()[0] in zip_word] 
 if zip_match:
 zp="1"
 zip_match=zip_match[0]
 else:
 zip_match="None"
 outputs.append(str(id1)+"\t"+str(id2)+"\t"+web+"\t"+ad+"\t"+word+"\t"+zp+"\t"+zip_match)
 if len(outputs)>=150:
 with open(filename,"a") as output:
 output.write("\n".join(outputs)+"\n")
 del outputs[:]
 except Exception,e:
 print e
 print traceback.print_exc()
 with open("us_error.txt","a") as ded:
 ded.write(str(id1)+"\t"+web+"\t"+str(e)+"\n") 
 if outputs:
 with open(filename,"a") as output:
 output.write("\n".join(outputs)+"\n")
 del outputs[:]
 infile = open(filename,'r') 
 ftp = ftplib.FTP(FTP_IP,FTP_USER ,FTP_PWD ) 
 ftp.storbinary('STOR '+FTP_LOCATION+str(filename), infile) 
 infile.close() 
 ftp.quit()
 except Exception,e:
 print e
 print traceback.print_exc()
 with open("us_error_process_error.txt","a") as ded:
 ded.write( str(e))
if __name__ == '__main__':
 try:
 access_string='DRIVER={SQL Server};SERVER='+SERVER+';DATABASE='+DATABASE+';UID='+UID+';PWD='+PWD
 cnxn = pyodbc.connect(access_string,autocommit=True)
 cursor = cnxn.cursor()
 print sys.argv[1]+'\n'+sys.argv[2]
 exec1=server_details["QUERY"]
 cursor.execute(exec1,(sys.argv[1],sys.argv[2]))
 rows = cursor.fetchall()
 cnxn.close()
 ProcessFile(rows)
 end_time_split=datetime.datetime.now()
 elasped_time1=end_time_split-start_time_split
 print "Total time:"+str(elasped_time1)
 print "done"
 except Exception,e:
 print e
 print traceback.print_exc()
 with open("us_error_starting_process_error.txt","a") as ded:
 ded.write(str(e))
Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Aug 20, 2015 at 7:20
\$\endgroup\$
0

2 Answers 2

4
\$\begingroup\$

I'm going to reference stylistic and formatting stuff, you should definitely read through PEP0008.

You can remove the comment at the start, it's pretty clear that you're importing modules. Comments should be used to add information that wouldn't otherwise be clear. It's also recommended that you group modules together for slightly easier readability rather than one long list. Since you're importing a lot and most only require one or two uses, you could use from _ import _ to cut down on some importing. I wouldn't recomment from json import load since load is a generic name, but from ftplib import FTP and from random import randint are unlikely to clash.

import datetime
import json
import re
import sys
import time
import traceback
import requests
import pyodbc
from ftplib import FTP
from random import randint

Even for one line, you should use with. Especially because json.load can easily generate errors for invalid syntax. Almost always with will be better. I haven't found a reason not to use it personally.

#loading regex files
start_time_split=datetime.datetime.now()
with open('server_details.json') as json_data:
 server_details = json.load(json_data)

Why is '''SQL SERVER CREDENTIALS AND DB NAME''' not a comment? It functions essentially as a one line comment but formatted differently.

You should separate distinct blocks with whitespace, it's more readable. Like this (using your original code):

import traceback
#loading regex files
start_time_split=datetime.datetime.now()
json_data=open('server_details.json')
server_details = json.load(json_data)
json_data.close() 
'''SQL SERVER CREDENTIALS AND DB NAME'''
SERVER=server_details["SERVER"]
DATABASE=server_details["DATABASE"]
UID=server_details["UID"]
PWD=server_details["PWD"]
ADDRESS=server_details["ADDRESS"]
ZIP=server_details["ZIP"]
ZIP_WORD=server_details["ZIP_WORD"]
FTP_IP=server_details["FTP_IP"]
FTP_USER=server_details["FTP_USER"]
FTP_PWD=server_details["FTP_PWD"]
FTP_LOCATION=server_details["FTP_LOCATION"]
def ProcessFile(fileReadStr):

You have an unnecessary double spaces in a few places, which are just unnecessary. Also open defaults to read only mode, so 'r' isn't necessary.

with open(ADDRESS,"r") as l:
with open(ZIP,"r") as ll:

Also you can reuse the name in this case, though I'd personally recommend using f instead for a briefly opened file. At least avoid a lone l which can be mistaken for a number one or capital i.

Also these comments are either removed pieces of old code or redundant notes. The variable names are clear enough that they don't need to be noted.

#address=address 
#regex=regex

It's good practice to use parentheses for making code more Python3 compatible, and also you could use str.format instead of str() and concatenation. Also again I'd put an empty line after this line to separate this block of code from another.

print ("Started {}".format(start_time_split))

Setting the filename is a long line, try breaking it up over multiple lines for readability. I'd recommend format again here. I'd also recommend using the {0:06d} syntax that will pad your random int with zeroes so that filenames have equal numbers of characters at the start for neater display together. Don't forget that if you did change to from random import randint you just need to use randint() here.

filename= "{:06d}_{}_us_link_classifier.txt".format(
 randint(1,100000), int(time.time()))

Your output.write would be neater using a join, like this:

output.write("\t".join(["ID", "ClientID", "URL", "Address", "word", "zip", "zipmatch"])
output.write("\n")

That allows you to add new values easier and makes the list of headers more readable.

When setting your web variable you forget to strip line[1] if it does start with 'http' (as the ternary condition value isn't kept in any way). It'd be easier to by strip line[1] first and then do your shorter ternary expression.

line[1] = line[1].strip()
web = line[1] if line[1].startswith("http") else "http://"+line[1]

You should also be adding some whitespace to these variable declarations

id1 = line[0]
web = line[1] if line[1].strip().startswith("http") else "http://"+line[1].strip()
id2 = line[2] 
ad = "0"
word = "None" 

And add some whitespace in your parameter lists:

objReq = requests.get(web, timeout=30, verify=False, allow_redirects=True, headers=header)
htmlStr = objReq.content
htmlStr=re.sub("\s+", " ", htmlStr)

I'm not sure if this is in your code or just the transcript here but you forgot to indent after with here:

if outputs:
 with open(filename,"a") as output:
 output.write("\n".join(outputs)+"\n")
 del outputs[:]

Here's the overall formatting changes I'd make to your script:

import datetime
import json
import re
import sys
import time
import traceback
import requests
import pyodbc
from ftplib import FTP
from random import randint
# Loading regex files
start_time_split = datetime.datetime.now()
with open('server_details.json') as json_data:
 server_details = json.load(json_data)
# SQL SERVER CREDENTIALS AND DB NAME
SERVER = server_details["SERVER"]
DATABASE = server_details["DATABASE"]
UID = server_details["UID"]
PWD = server_details["PWD"]
ADDRESS = server_details["ADDRESS"]
ZIP = server_details["ZIP"]
ZIP_WORD = server_details["ZIP_WORD"]
FTP_IP = server_details["FTP_IP"]
FTP_USER = server_details["FTP_USER"]
FTP_PWD = server_details["FTP_PWD"]
FTP_LOCATION = server_details["FTP_LOCATION"]
def ProcessFile(fileReadStr):
 try:
 header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0'}
 with open(ADDRESS) as f:
 address="|".join([a.strip() for a in f])
 with open(ZIP) as f:
 regex="|".join([a.strip() for a in f])
 with open(ZIP_WORD) as f:
 zip_word=[a.strip() for a in f.readlines()]
 print ("Started {}".format(start_time_split))
 filename= "{:06d}_{}_us_link_classifier.txt".format(
 randint(1,100000), int(time.time()))
 with open(filename,"w") as output:
 output.write("ID\tClientID\tURL\tAddress\tword\tzip\tzipmatch\n")
 outputs = []
 for line in fileReadStr:
 id1 = line[0]
 line[1] = line[1].strip()
 web = line[1] if line[1].startswith("http") else "http://" + line[1]
 id2 = line[2] 
 ad = "0"
 word = "None" 
 try:
 objReq = requests.get(web, timeout=30, verify=False,
 allow_redirects=True, headers=header)
 htmlStr = objReq.content
 htmlStr = re.sub("\s+", " ", htmlStr)
 htmlStr = re.sub("<!--.*?-->", " ", htmlStr)
 htmlStr = re.sub("<head.*?>.*?</head>", " ", htmlStr)
 htmlStr = re.sub("<.*?>", " ", htmlStr) 
 htmlStr = re.sub("\s+", " ", htmlStr)
 address_match = re.search(address, htmlStr, re.IGNORECASE)
 if address_match :
 ad = "1"
 word = address_match.group(0).strip()
 address_zip = re.findall(regex, htmlStr)
 zip_match = [el for sub in address_zip for el in sub if el and el.split()[0] in zip_word]
 if zip_match:
 zp = "1"
 zip_match = zip_match[0]
 else:
 zip_match = "None"
 outputs.append("\t".join([str(id1), str(id2), web, ad, word, zp, zip_match]))
 if len(outputs) >= 150:
 with open(filename,"a") as output:
 output.write("\n".join(outputs)+"\n")
 del outputs[:]
 except Exception,e:
 print e
 print traceback.print_exc()
 with open("us_error.txt","a") as ded:
 ded.write('\t'.join(str(id1), web, str(e)) + "\n")
 if outputs:
 with open(filename, "a") as output:
 output.write("\n".join(outputs) + "\n")
 del outputs[:]
 with open(filename) as infile:
 ftp = ftplib.FTP(FTP_IP,FTP_USER ,FTP_PWD ) 
 ftp.storbinary('STOR '+FTP_LOCATION+str(filename), infile) 
 ftp.quit()
 except Exception,e:
 print e
 print traceback.print_exc()
 with open("us_error_process_error.txt","a") as ded:
 ded.write(str(e))
if __name__ == '__main__':
 try:
 access_string = 'DRIVER={SQL Server};SERVER='+SERVER+';DATABASE='+DATABASE+';UID='+UID+';PWD='+PWD
 cnxn = pyodbc.connect(access_string, autocommit=True)
 cursor = cnxn.cursor()
 print (sys.argv[1]+'\n'+sys.argv[2])
 exec1 = server_details["QUERY"]
 cursor.execute(exec1, (sys.argv[1], sys.argv[2]))
 rows = cursor.fetchall()
 cnxn.close()
 ProcessFile(rows)
 end_time_split = datetime.datetime.now()
 elasped_time1 = end_time_split-start_time_split
 print ("Total time:{}".format(elasped_time1))
 print ("Done")
 except Exception,e:
 print e
 print traceback.print_exc()
 with open("us_error_starting_process_error.txt","a") as ded:
 ded.write(str(e))
answered Aug 20, 2015 at 9:33
\$\endgroup\$
2
  • \$\begingroup\$ Thanks for your time and your points but will this affect the running time of the script \$\endgroup\$ Commented Aug 20, 2015 at 9:42
  • \$\begingroup\$ @VigneshKalai It wont affect the performance, I'm not that experienced with using regex to help that. But it does make it easier to read and refactor things so you can spot parts for improvement. \$\endgroup\$ Commented Aug 20, 2015 at 10:45
3
\$\begingroup\$

I strongly agree with @SuperBiasedMan's comments about formatting. His reformatting makes your code much more readable. Here are a few more comments:

Production vs development

If you are using this code in production, you probably don't want to print the traceback to the user. Printing the error message should be enough. I'd cut the error handling down to this:

except Exception,e:
 print e
 with open("us_error_starting_process_error.txt","a") as ded:
 ded.write(str(e))

Also you might consider some variable you can set to print or not depending on when the code is being used in production or not. For example something like

if __name__ == '__main__':
 TO_PRINT = True
else:
 TO_PRINT = False

And then:

if TO_PRINT:
 print...

Formatting

It's good that you wrote this module so that the script can be run or imported, but you should go one step further and declare a main() function (read more about it here). You can move all your try...except code into that.

Similarly, I spotted at least one typo:

elasped_time1

You want to avoid typos because it makes it harder for users to remember how to type your variables as well as just being annoying to replicate a misspelling.

Use 'with' to control resource use/closing

It's always better to format your Python code so that you don't need to remember to release resources. That way if you don't, no problem. I'd rewrite this part of your new main() function like this:

 with cnxn = pyodbc.connect(access_string,autocommit=True)
 cursor = cnxn.cursor()
 print sys.argv[1]+'\n'+sys.argv[2]
 exec1=server_details["QUERY"]
 cursor.execute(exec1,(sys.argv[1],sys.argv[2]))
 rows = cursor.fetchall()
answered Aug 20, 2015 at 12:48
\$\endgroup\$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.