I've written a program to get the names along with the titles of some practitioners out of a webpage. The content stored within disorganized HTML elements (at least it seemed to me) and as a result the output becomes messier. However, I've managed to get the output as it should be.
The way I've written the scraper serves its purpose just fine but the scraper itself looks ugly. Is there any way to make the scraper look good and still have it do the same job as it is doing currently? For example, is there a way to eliminate the exception handlers?
This is how I've written it:
import requests
from bs4 import BeautifulSoup
url = "https://www.otla.com/index.cfm?&m_firstname=a&seed=566485&fs_match=s&pg=publicdirectory&diraction=SearchResults&memPageNum=1"
def get_names(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,'lxml')
for content in soup.select(".row-fluid"):
try:
name = content.select_one(".span5 b").get_text(strip=True)
except: continue #let go of blank elements
else: name
try:
title = content.select_one(".span5 b").next_sibling.next_sibling.strip()
except: continue #let go of blank elements
else: title
print(name,title)
if __name__ == '__main__':
get_names(url)
3 Answers 3
As @scnerd already discussed avoiding the use of a bare try/except
clause, I'll focus on removing it entirely.
After inspecting website source, you will notice that elements of class row-fluid
are containers for each lawyer. In each of those elements, there is two .span5 clearfix
elements, one containing lawyer's name and firm, other their specialization. Since you don't seem to be interested in the latter, we can skip that element entirely and move onto the next one.
for element in soup.select(".span5 b"):
name = element.text.strip()
if name.startswith('Area'):
continue
title = element.next_sibling.next_sibling.strip()
print('{}: {}'.format(name, title))
You will notice I left out row-fluid
container from the selector, as we are iterating only over the span5
elements that are contained within them, however if you wanted to keep that it, you can chain the CSS selector as such: soup.select(".row-fluid .span5 b")
. If there was any elements of the span5
class outside of the row-fluid
containers, it would be better to chain the CSS, making it more explicit.
get_names
is a fairly ambiguous function name that also suggests it will return an iterable with names. What you are doing in this function is printing names of lawyers, together with the firm they're working for. I'd suggest renaming it to print_lawyer_information
, or better yet get_lawyer_information
and return a dictionary containing the name
as key and title
as value.
lawyer_information = {}
for element in soup.select(".span5 b"):
name = element.text.strip()
if name.startswith('Area'):
continue
title = element.next_sibling.next_sibling.strip()
lawyer_information[name] = title
return lawyer_information
As you can see above, we're creating an empty dictionary and then, as we iterate over the elements, we append records to it. Once that is done, you just return the dictionary, which then you can print or do more things with. This can be done with a much neater dictionary comprehension.
return {row.text.strip(): row.next_sibling.next_sibling.strip()
for row in soup.select('.span5 b')
if not row.text.strip().startswith('Area')}
A couple other nitpicks would involve some PEP-8 (Python Style Guide) violations, such as lack of 2 lines before function definition and multiple statements on one line. Run your code through something like http://pep8online.com and you'll get a better idea of how to make your code easier to read and follow. You'll thank yourself in a years time when you look at your code again.
import requests
from bs4 import BeautifulSoup
def get_lawyer_information(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
return {element.text.strip(): element.next_sibling.next_sibling.strip()
for element in soup.select('.span5 b')
if not element.text.strip().startswith('Area')}
if __name__ == '__main__':
url = "https://www.otla.com/index.cfm?&m_firstname=a&seed=566485&fs_match=s&pg=publicdirectory&diraction=SearchResults&memPageNum=1"
lawyer_information = get_lawyer_information(url)
for name, title in lawyer_information.items():
print('{} | {}'.format(name, title))
-
1\$\begingroup\$ This is undoubtedly the approach I wished to comply. \$\endgroup\$SIM– SIM2018年06月01日 06:08:55 +00:00Commented Jun 1, 2018 at 6:08
-
1\$\begingroup\$ @Topto Do you intend to iterate over next pages of the directory? If so, look into creating a spider, with something like
scrapy
. Otherwiserequests.Session
is something worth looking into, for consequent requests to the same host. \$\endgroup\$Luke– Luke2018年06月01日 11:25:40 +00:00Commented Jun 1, 2018 at 11:25
Don't use a bare except
statement if it's not absolutely necessary. Catch the errors you anticipate and explicitly want to handle.
Also, there's no reason for the duplicated try/except/else
statements, and no use, in this case, of the else
statements at all. You can collapse everything into a single try/except
, since you do the same thing regardless of which line throws the error.
You also have a little code duplication. Don't Repeat Yourself (DRY).
def get_names(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,'lxml')
for content in soup.select(".row-fluid"):
try:
element = content.select_one(".span5 b")
name = element.get_text(strip=True)
title = element.next_sibling.next_sibling.strip()
print(name,title)
except AttributeError:
continue
Why don't you put the entire body of the while loop inside the try block? And you can just leave off the else
clause.
for content in soup.select(".row-fluid"):
try:
name = content.select_one(".span5 b").get_text(strip=True)
title = content.select_one(".span5 b").next_sibling.next_sibling.strip()
print(name,title)
except: continue #let go of blank elements
-
\$\begingroup\$ A slight little problem with this approach is that if in a container an address is none then the script will not print the name either. I meant, they both will get printed at the same time or they both will be overlooked if any item is none. \$\endgroup\$SIM– SIM2018年05月31日 21:22:19 +00:00Commented May 31, 2018 at 21:22
-
\$\begingroup\$ Yeah but that's the behavior of your original script also. \$\endgroup\$Segfault– Segfault2018年05月31日 21:23:16 +00:00Commented May 31, 2018 at 21:23
-
\$\begingroup\$ Right you are. I didn't know that. I just checked. \$\endgroup\$SIM– SIM2018年05月31日 21:33:14 +00:00Commented May 31, 2018 at 21:33
-
\$\begingroup\$ Maybe instead of
continue
you meantpass
? That would allow the code to continue progressing through the loop body even if you can't find the value you're looking for. E.g.,try: x = ...; except AttributeError: x = default_value
\$\endgroup\$scnerd– scnerd2018年05月31日 21:44:52 +00:00Commented May 31, 2018 at 21:44 -
\$\begingroup\$ I tried that in the first place but it gives lots of blank lines between the results that is the reason I used
continue
. \$\endgroup\$SIM– SIM2018年05月31日 21:51:18 +00:00Commented May 31, 2018 at 21:51
Explore related questions
See similar questions with these tags.
try:except:continue
block is what I meant by that. However, i didn't know that a working script can also get downvote here. \$\endgroup\$