5
\$\begingroup\$

This is my first program in Python. Following code is taken from different posts (mostly this site) and combined together so that I can automate my routine task.

It's working properly but I need opinion from experts to enhance it further.

import mechanize
from bs4 import BeautifulSoup
import numpy as np
import csv
import os
# reading file
np.genfromtxt('EAN.txt', delimiter ='\n')
dtype=str
with open('EAN.txt') as f:
 lines = f.read().splitlines()
new_dictionary = {}
count = 0
for line in lines:
 count += 1
 new_dictionary['sequence_{}'.format(count)] = line
# searching items
print "Running..."
for i in new_dictionary:
 myean = new_dictionary[i]
 url = "https://mysite"
 br = mechanize.Browser()
 br.set_handle_robots(False)
 br.open(url)
 br.select_form(id="searchForm")
 br["q"] = myean
 res = br.submit()
 content = res.read()
 with open("mechanize_results.html", "a") as f:
 f.write(content)
 soup = BeautifulSoup(open("mechanize_results.html"),'html.parser')
 for div in soup.findAll('div', {'class': 'small-12 columns product-title'}):
 a = div.findAll('a')[1]
 #writing file
 if a is None:
 with open('Results.csv', 'ab') as csvfile:
 spamwriter = csv.writer(csvfile, delimiter='|',
 quotechar='|', quoting=csv.QUOTE_MINIMAL)
 spamwriter.writerow([myean, "Rejected"])
 #, title.string])
 else:
 #print myean, '|', a.text.strip()
 with open('Results.csv', 'ab') as csvfile:
 spamwriter = csv.writer(csvfile, delimiter='|',
 quotechar='|', quoting=csv.QUOTE_MINIMAL)
 spamwriter.writerow([myean, "Approved", a.text.strip()])
 #, title.string])
 #deleting file
 os.remove("mechanize_results.html")
 a=None
 i =+ 1
print "%d items Searched" %count
new_dictionary.clear()
raw_input("Press any key to continue...")

Although this code is running perfectly but I was wondering that when I used pyinstaller to create exe file, size became around 400+MB. Is it because, I am importing complete package of 'mechanize', 'numpy', 'csv' and 'os'?

200_success
146k22 gold badges190 silver badges478 bronze badges
asked Aug 29, 2017 at 7:22
\$\endgroup\$
1
  • 1
    \$\begingroup\$ Don't know about mechanize, but yes, the size is mostly due to numpy. \$\endgroup\$ Commented Aug 29, 2017 at 7:52

1 Answer 1

5
\$\begingroup\$

There are multiple things to tackle and improve, let's summarize.

Code Style

In general, try to follow the PEP8 style guide, you have multiple violations, like:

  • use of spaces around operators and blank lines
  • organizing imports

There are tools like flake8 or pylint which can analyze your code statically and report existing violations, make sure to give them a try.

Variable naming is also a huge issue in the presented code - variables like i or my_dictionary are meaningless - make sure to give your variables meaningful names to improve readability. Remember: code is much more often read than written.

Code Organization

Currently, you are putting everything into a single program without any organization - think about logically splitting it into functions. You should also put your main execution logic into the if __name__ == '__main__': to avoid the code being executed on import.

Making the code Pythonic

  • you can use a dictionary comprehension with enumerate() to define your new_dictionary dictionary:

    with open('EAN.txt') as f:
     new_dictionary = {
     'sequence_{}'.format(line_number): line
     for line_number, line in enumerate(f)
     }
    

    On the second though, you don't actually even need a dictionary here - the keys are useless in the main execution logic of the program. A list of search queries should be sufficient enough.

  • you don't need read().splitlines() and can directly iterate over the file object lines - this will avoid having an extra list of lines created in memory and allow to get every next line in a "lazy" fashion

  • use print() as a function for Python-3.x compatibility

Performance and Web-Scraping related Improvements

  • First of all, you don't need to save the page source into a file and then reading that file to parse with BeautifulSoup - you can avoid that step altogether by feeding mechanize Browser's page source into BeautifulSoup directly.
  • You should also be able to initialize the Browser once and re-use for all the subsequent requests.
  • Also, you may improve on HTML parsing speed by using lxml as an underlying parser for BeautifulSoup - this requires lxml to be installed though.
  • I would also use a .product-title CSS selector to match product titles.
  • I would just collect the results and then write at the end instead of dealing with re-opening the resulting CSV file in the loop.
  • Performance-wise, this is also a good use case for a SoupStrainer which will make BeautifulSoup parse only the desired parts of a document - in your case the product links

Here is the code with some of the improvements applied:

import csv
from bs4 import BeautifulSoup, SoupStrainer
import mechanize
URL = "https://mysite"
if __name__ == '__main__':
 print("Running...")
 results = []
 with open('EAN.txt') as f, mechanize.Browser() as browser:
 browser.set_handle_robots(False)
 for search_query in f:
 browser.open(URL)
 browser.select_form(id="searchForm")
 browser["q"] = search_query
 response = browser.submit()
 soup = BeautifulSoup(response, 'lxml', parse_only=SoupStrainer(class_='product_title'))
 product_title = soup.select_one('.product-title')
 product_title_text = product_title.get_text(strip=True) if product_title else None
 results.append([search_query, product_title_text])
 # dump results
 with open('Results.csv', 'ab') as csvfile:
 writer = csv.writer(csvfile, delimiter='|', quotechar='|', quoting=csv.QUOTE_MINIMAL)
 for search_query, product_title in results:
 result_label = "Approved" if product_title else "Rejected"
 writer.writerow([search_query, result_label, product_title])

Might not work as is, since I don't have a way to test it. There are also some assumptions there - like the presence of a only one "product title" on a search result page per form query. But, at least, I hope that will give you something to start with.

And, I am not sure you need that np.genfromtxt() since currently you don't even read the results of this function call - hence removed that part in the posted code above.

answered Aug 29, 2017 at 15:09
\$\endgroup\$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.