Saving web search results for EANs to CSV

Question 1

This is my first program in Python. Following code is taken from different posts (mostly this site) and combined together so that I can automate my routine task.

It's working properly but I need opinion from experts to enhance it further.

import mechanize
from bs4 import BeautifulSoup
import numpy as np
import csv
import os
# reading file
np.genfromtxt('EAN.txt', delimiter ='\n')
dtype=str
with open('EAN.txt') as f:
 lines = f.read().splitlines()
new_dictionary = {}
count = 0
for line in lines:
 count += 1
 new_dictionary['sequence_{}'.format(count)] = line
# searching items
print "Running..."
for i in new_dictionary:
 myean = new_dictionary[i]
 url = "https://mysite"
 br = mechanize.Browser()
 br.set_handle_robots(False)
 br.open(url)
 br.select_form(id="searchForm")
 br["q"] = myean
 res = br.submit()
 content = res.read()
 with open("mechanize_results.html", "a") as f:
 f.write(content)
 soup = BeautifulSoup(open("mechanize_results.html"),'html.parser')
 for div in soup.findAll('div', {'class': 'small-12 columns product-title'}):
 a = div.findAll('a')[1]
 #writing file
 if a is None:
 with open('Results.csv', 'ab') as csvfile:
 spamwriter = csv.writer(csvfile, delimiter='|',
 quotechar='|', quoting=csv.QUOTE_MINIMAL)
 spamwriter.writerow([myean, "Rejected"])
 #, title.string])
 else:
 #print myean, '|', a.text.strip()
 with open('Results.csv', 'ab') as csvfile:
 spamwriter = csv.writer(csvfile, delimiter='|',
 quotechar='|', quoting=csv.QUOTE_MINIMAL)
 spamwriter.writerow([myean, "Approved", a.text.strip()])
 #, title.string])
 #deleting file
 os.remove("mechanize_results.html")
 a=None
 i =+ 1
print "%d items Searched" %count
new_dictionary.clear()
raw_input("Press any key to continue...")

Although this code is running perfectly but I was wondering that when I used pyinstaller to create exe file, size became around 400+MB. Is it because, I am importing complete package of 'mechanize', 'numpy', 'csv' and 'os'?

Question 2

Don't know about mechanize, but yes, the size is mostly due to numpy.

Question 3

There are multiple things to tackle and improve, let's summarize.

Code Style

In general, try to follow the PEP8 style guide, you have multiple violations, like:

use of spaces around operators and blank lines
organizing imports

There are tools like flake8 or pylint which can analyze your code statically and report existing violations, make sure to give them a try.

Variable naming is also a huge issue in the presented code - variables like i or my_dictionary are meaningless - make sure to give your variables meaningful names to improve readability. Remember: code is much more often read than written.

Code Organization

Currently, you are putting everything into a single program without any organization - think about logically splitting it into functions. You should also put your main execution logic into the if __name__ == '__main__': to avoid the code being executed on import.

Making the code Pythonic

you can use a dictionary comprehension with enumerate() to define your new_dictionary dictionary:
```
with open('EAN.txt') as f:
 new_dictionary = {
 'sequence_{}'.format(line_number): line
 for line_number, line in enumerate(f)
 }
```
On the second though, you don't actually even need a dictionary here - the keys are useless in the main execution logic of the program. A list of search queries should be sufficient enough.
you don't need read().splitlines() and can directly iterate over the file object lines - this will avoid having an extra list of lines created in memory and allow to get every next line in a "lazy" fashion
use print() as a function for Python-3.x compatibility

Performance and Web-Scraping related Improvements

First of all, you don't need to save the page source into a file and then reading that file to parse with BeautifulSoup - you can avoid that step altogether by feeding mechanize Browser's page source into BeautifulSoup directly.
You should also be able to initialize the Browser once and re-use for all the subsequent requests.
Also, you may improve on HTML parsing speed by using lxml as an underlying parser for BeautifulSoup - this requires lxml to be installed though.
I would also use a .product-title CSS selector to match product titles.
I would just collect the results and then write at the end instead of dealing with re-opening the resulting CSV file in the loop.
Performance-wise, this is also a good use case for a SoupStrainer which will make BeautifulSoup parse only the desired parts of a document - in your case the product links

Here is the code with some of the improvements applied:

import csv
from bs4 import BeautifulSoup, SoupStrainer
import mechanize
URL = "https://mysite"
if __name__ == '__main__':
 print("Running...")
 results = []
 with open('EAN.txt') as f, mechanize.Browser() as browser:
 browser.set_handle_robots(False)
 for search_query in f:
 browser.open(URL)
 browser.select_form(id="searchForm")
 browser["q"] = search_query
 response = browser.submit()
 soup = BeautifulSoup(response, 'lxml', parse_only=SoupStrainer(class_='product_title'))
 product_title = soup.select_one('.product-title')
 product_title_text = product_title.get_text(strip=True) if product_title else None
 results.append([search_query, product_title_text])
 # dump results
 with open('Results.csv', 'ab') as csvfile:
 writer = csv.writer(csvfile, delimiter='|', quotechar='|', quoting=csv.QUOTE_MINIMAL)
 for search_query, product_title in results:
 result_label = "Approved" if product_title else "Rejected"
 writer.writerow([search_query, result_label, product_title])

Might not work as is, since I don't have a way to test it. There are also some assumptions there - like the presence of a only one "product title" on a search result page per form query. But, at least, I hope that will give you something to start with.

And, I am not sure you need that np.genfromtxt() since currently you don't even read the results of this function call - hence removed that part in the posted code above.

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Answer 1 · 2017-08-29 15:09:54Z

There are multiple things to tackle and improve, let's summarize.

Code Style

In general, try to follow the PEP8 style guide, you have multiple violations, like:

use of spaces around operators and blank lines
organizing imports

There are tools like flake8 or pylint which can analyze your code statically and report existing violations, make sure to give them a try.

Variable naming is also a huge issue in the presented code - variables like i or my_dictionary are meaningless - make sure to give your variables meaningful names to improve readability. Remember: code is much more often read than written.

Code Organization

Currently, you are putting everything into a single program without any organization - think about logically splitting it into functions. You should also put your main execution logic into the if __name__ == '__main__': to avoid the code being executed on import.

Making the code Pythonic

you can use a dictionary comprehension with enumerate() to define your new_dictionary dictionary:
```
with open('EAN.txt') as f:
 new_dictionary = {
 'sequence_{}'.format(line_number): line
 for line_number, line in enumerate(f)
 }
```
On the second though, you don't actually even need a dictionary here - the keys are useless in the main execution logic of the program. A list of search queries should be sufficient enough.
you don't need read().splitlines() and can directly iterate over the file object lines - this will avoid having an extra list of lines created in memory and allow to get every next line in a "lazy" fashion
use print() as a function for Python-3.x compatibility

Performance and Web-Scraping related Improvements

First of all, you don't need to save the page source into a file and then reading that file to parse with BeautifulSoup - you can avoid that step altogether by feeding mechanize Browser's page source into BeautifulSoup directly.
You should also be able to initialize the Browser once and re-use for all the subsequent requests.
Also, you may improve on HTML parsing speed by using lxml as an underlying parser for BeautifulSoup - this requires lxml to be installed though.
I would also use a .product-title CSS selector to match product titles.
I would just collect the results and then write at the end instead of dealing with re-opening the resulting CSV file in the loop.
Performance-wise, this is also a good use case for a SoupStrainer which will make BeautifulSoup parse only the desired parts of a document - in your case the product links

Here is the code with some of the improvements applied:

import csv
from bs4 import BeautifulSoup, SoupStrainer
import mechanize
URL = "https://mysite"
if __name__ == '__main__':
 print("Running...")
 results = []
 with open('EAN.txt') as f, mechanize.Browser() as browser:
 browser.set_handle_robots(False)
 for search_query in f:
 browser.open(URL)
 browser.select_form(id="searchForm")
 browser["q"] = search_query
 response = browser.submit()
 soup = BeautifulSoup(response, 'lxml', parse_only=SoupStrainer(class_='product_title'))
 product_title = soup.select_one('.product-title')
 product_title_text = product_title.get_text(strip=True) if product_title else None
 results.append([search_query, product_title_text])
 # dump results
 with open('Results.csv', 'ab') as csvfile:
 writer = csv.writer(csvfile, delimiter='|', quotechar='|', quoting=csv.QUOTE_MINIMAL)
 for search_query, product_title in results:
 result_label = "Approved" if product_title else "Rejected"
 writer.writerow([search_query, result_label, product_title])

Might not work as is, since I don't have a way to test it. There are also some assumptions there - like the presence of a only one "product title" on a search result page per form query. But, at least, I hope that will give you something to start with.

And, I am not sure you need that np.genfromtxt() since currently you don't even read the results of this function call - hence removed that part in the posted code above.

Stack Exchange Network

Saving web search results for EANs to CSV

1 Answer 1

Code Style

Code Organization

Making the code Pythonic

Performance and Web-Scraping related Improvements

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Saving web search results for EANs to CSV

1 Answer 1

Code Style

Code Organization

Making the code Pythonic

Performance and Web-Scraping related Improvements

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions