This is my first program in Python. Following code is taken from different posts (mostly this site) and combined together so that I can automate my routine task.
It's working properly but I need opinion from experts to enhance it further.
import mechanize
from bs4 import BeautifulSoup
import numpy as np
import csv
import os
# reading file
np.genfromtxt('EAN.txt', delimiter ='\n')
dtype=str
with open('EAN.txt') as f:
lines = f.read().splitlines()
new_dictionary = {}
count = 0
for line in lines:
count += 1
new_dictionary['sequence_{}'.format(count)] = line
# searching items
print "Running..."
for i in new_dictionary:
myean = new_dictionary[i]
url = "https://mysite"
br = mechanize.Browser()
br.set_handle_robots(False)
br.open(url)
br.select_form(id="searchForm")
br["q"] = myean
res = br.submit()
content = res.read()
with open("mechanize_results.html", "a") as f:
f.write(content)
soup = BeautifulSoup(open("mechanize_results.html"),'html.parser')
for div in soup.findAll('div', {'class': 'small-12 columns product-title'}):
a = div.findAll('a')[1]
#writing file
if a is None:
with open('Results.csv', 'ab') as csvfile:
spamwriter = csv.writer(csvfile, delimiter='|',
quotechar='|', quoting=csv.QUOTE_MINIMAL)
spamwriter.writerow([myean, "Rejected"])
#, title.string])
else:
#print myean, '|', a.text.strip()
with open('Results.csv', 'ab') as csvfile:
spamwriter = csv.writer(csvfile, delimiter='|',
quotechar='|', quoting=csv.QUOTE_MINIMAL)
spamwriter.writerow([myean, "Approved", a.text.strip()])
#, title.string])
#deleting file
os.remove("mechanize_results.html")
a=None
i =+ 1
print "%d items Searched" %count
new_dictionary.clear()
raw_input("Press any key to continue...")
Although this code is running perfectly but I was wondering that when I used pyinstaller to create exe file, size became around 400+MB. Is it because, I am importing complete package of 'mechanize', 'numpy', 'csv' and 'os'?
-
1\$\begingroup\$ Don't know about mechanize, but yes, the size is mostly due to numpy. \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2017年08月29日 07:52:08 +00:00Commented Aug 29, 2017 at 7:52
1 Answer 1
There are multiple things to tackle and improve, let's summarize.
Code Style
In general, try to follow the PEP8 style guide, you have multiple violations, like:
- use of spaces around operators and blank lines
- organizing imports
There are tools like flake8
or pylint
which can analyze your code statically and report existing violations, make sure to give them a try.
Variable naming is also a huge issue in the presented code - variables like i
or my_dictionary
are meaningless - make sure to give your variables meaningful names to improve readability. Remember: code is much more often read than written.
Code Organization
Currently, you are putting everything into a single program without any organization - think about logically splitting it into functions. You should also put your main execution logic into the if __name__ == '__main__':
to avoid the code being executed on import.
Making the code Pythonic
you can use a dictionary comprehension with
enumerate()
to define yournew_dictionary
dictionary:with open('EAN.txt') as f: new_dictionary = { 'sequence_{}'.format(line_number): line for line_number, line in enumerate(f) }
On the second though, you don't actually even need a dictionary here - the keys are useless in the main execution logic of the program. A list of search queries should be sufficient enough.
you don't need
read().splitlines()
and can directly iterate over the file object lines - this will avoid having an extra list of lines created in memory and allow to get every next line in a "lazy" fashion- use
print()
as a function for Python-3.x compatibility
Performance and Web-Scraping related Improvements
- First of all, you don't need to save the page source into a file and then reading that file to parse with
BeautifulSoup
- you can avoid that step altogether by feeding mechanize Browser's page source intoBeautifulSoup
directly. - You should also be able to initialize the
Browser
once and re-use for all the subsequent requests. - Also, you may improve on HTML parsing speed by using
lxml
as an underlying parser forBeautifulSoup
- this requireslxml
to be installed though. - I would also use a
.product-title
CSS selector to match product titles. - I would just collect the results and then write at the end instead of dealing with re-opening the resulting CSV file in the loop.
- Performance-wise, this is also a good use case for a
SoupStrainer
which will makeBeautifulSoup
parse only the desired parts of a document - in your case the product links
Here is the code with some of the improvements applied:
import csv
from bs4 import BeautifulSoup, SoupStrainer
import mechanize
URL = "https://mysite"
if __name__ == '__main__':
print("Running...")
results = []
with open('EAN.txt') as f, mechanize.Browser() as browser:
browser.set_handle_robots(False)
for search_query in f:
browser.open(URL)
browser.select_form(id="searchForm")
browser["q"] = search_query
response = browser.submit()
soup = BeautifulSoup(response, 'lxml', parse_only=SoupStrainer(class_='product_title'))
product_title = soup.select_one('.product-title')
product_title_text = product_title.get_text(strip=True) if product_title else None
results.append([search_query, product_title_text])
# dump results
with open('Results.csv', 'ab') as csvfile:
writer = csv.writer(csvfile, delimiter='|', quotechar='|', quoting=csv.QUOTE_MINIMAL)
for search_query, product_title in results:
result_label = "Approved" if product_title else "Rejected"
writer.writerow([search_query, result_label, product_title])
Might not work as is, since I don't have a way to test it. There are also some assumptions there - like the presence of a only one "product title" on a search result page per form query. But, at least, I hope that will give you something to start with.
And, I am not sure you need that np.genfromtxt()
since currently you don't even read the results of this function call - hence removed that part in the posted code above.
Explore related questions
See similar questions with these tags.