2
\$\begingroup\$

I have a web scraper that I use in a part of a larger program. However, I feel like I semi-repeat my code a lot and take up a lot of room. Is there any way I can condense this code?

def read_mail(mail):
 url = [mail] # Ignore this line, please.
 i = 0 # Ignore this line, please.
 droppedSource = '<td class="item_dropped">(.+?)</td>' # Gets whatever is inbetween the tags
 destroyedSource = '<td class="item_destroyed">(.+?)</td>'
 totalSource = '<strong class="item_dropped">(.+?)</strong>'
 droppedText = re.compile(droppedSource) # Converts regex string into something that can be interpreted by regular library
 destroyedText = re.compile(destroyedSource)
 totalText = re.compile(totalSource)
 html = urllib.urlopen(url[i]).read() # ignore the url[i] part of this line, please.
 dropped = re.findall(droppedText,html)
 destroyed = re.findall(destroyedText,html)
 total = re.findall(totalText,html)
 return("Info: " + str(dropped[0])+str(destroyed[0])+str(total[0]))
200_success
145k22 gold badges190 silver badges478 bronze badges
asked Feb 27, 2015 at 2:48
\$\endgroup\$

2 Answers 2

4
\$\begingroup\$
  • First of all I would recommend not to use regex for handling HTML. You can use a library like BeautifulSoup for this.
  • As all we're doing is finding the first match using tag name and class name we can define a function that uses BeautifulSoup to find such matches bases on the tag and class name. BeautifulSoup provides two functions find and findAll, find returns the first match and findAll returns all the matches.

  • Note that in regex just to find the first match you should not use re.findall, better use re.search which only returns the first match found otherwise None.

  • On the last return line we can use string formatting.

from BeautifulSoup import BeautifulSoup
from functools import partial
def find_by_tag_name_class(soup, tag, cls_name, many=False):
 if many:
 matches = soup.findAll(tag, {"class": cls_name})
 return [match.text for match in matches]
 else:
 match = soup.find(tag, {"class": cls_name})
 return match.text
def read_mail(html):
 soup = BeautifulSoup(html)
 # Instead of passing the same `soup` multiple times to 
 # `find_by_tag_name_class` we can create a partial function
 # with `soup` already applied to it.
 find_by_tag_name_class_soup = partial(find_by_tag_name_class, soup) 
 dropped = find_by_tag_name_class_soup('td', 'item_dropped')
 destroyed = find_by_tag_name_class_soup('td', 'item_destroyed')
 total = find_by_tag_name_class_soup('strong', 'item_dropped')
 return "Info: {} {} {} " .format(dropped, destroyed, total)
html = '''<td class="item_dropped">Foo bar</td><td class="item_dropped">spam eggs</td>
<td class="item_destroyed">Hakuna</td><td class="item_destroyed">Matatat</td>
<strong class="item_dropped">Some strong text</strong><strong class="item_dropped">Even more strong text</strong>'''
print read_mail(html)
# Info: Foo bar Hakuna Some strong text 

Note that in the latest version of BeautifulSoup findAll has been renamed to find_all.

answered Feb 27, 2015 at 3:55
\$\endgroup\$
1
\$\begingroup\$

Ashwini provided a good answer, mostly in the form of reminding me for some reason I wasn't using BeautifulSoup like I almost always do. I deleted my program and remade it vastly better (in my opinion) with the following code:

def read_mail():
 urls = [mail]
 for url in urls:
 soup = BeautifulSoup(urllib.urlopen(url).read())
 dropped = soup.find("td", class_="item_dropped").get_text()
 destroyed = soup.find("td", class_="item_destroyed").get_text()
 total = soup.find("strong", class_="item_dropped").get_text()
 print("Info : %s, %s, %s") % (dropped, destroyed, total)
community wiki

\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.