Web-scraping a list of lawyers

Question 1

I've written a program to get the names along with the titles of some practitioners out of a webpage. The content stored within disorganized HTML elements (at least it seemed to me) and as a result the output becomes messier. However, I've managed to get the output as it should be.

The way I've written the scraper serves its purpose just fine but the scraper itself looks ugly. Is there any way to make the scraper look good and still have it do the same job as it is doing currently? For example, is there a way to eliminate the exception handlers?

This is how I've written it:

import requests 
from bs4 import BeautifulSoup
url = "https://www.otla.com/index.cfm?&m_firstname=a&seed=566485&fs_match=s&pg=publicdirectory&diraction=SearchResults&memPageNum=1"
def get_names(link):
 res = requests.get(link)
 soup = BeautifulSoup(res.text,'lxml')
 for content in soup.select(".row-fluid"):
 try:
 name = content.select_one(".span5 b").get_text(strip=True)
 except: continue #let go of blank elements
 else: name
 try:
 title = content.select_one(".span5 b").next_sibling.next_sibling.strip()
 except: continue #let go of blank elements
 else: title
 print(name,title)
if __name__ == '__main__':
 get_names(url)

Question 2

Can you define "good look" a bit more precise please.

Question 3

Without using try:except:continue block is what I meant by that. However, i didn't know that a working script can also get downvote here.

Question 4

Is the intent of the scraper that you ONLY get this one page, or do you also want to get the lawyers on the pages afterwards?

Question 5

As @scnerd already discussed avoiding the use of a bare try/except clause, I'll focus on removing it entirely.

After inspecting website source, you will notice that elements of class row-fluid are containers for each lawyer. In each of those elements, there is two .span5 clearfix elements, one containing lawyer's name and firm, other their specialization. Since you don't seem to be interested in the latter, we can skip that element entirely and move onto the next one.

for element in soup.select(".span5 b"):
 name = element.text.strip()
 if name.startswith('Area'):
 continue
 title = element.next_sibling.next_sibling.strip()
 print('{}: {}'.format(name, title))

You will notice I left out row-fluid container from the selector, as we are iterating only over the span5 elements that are contained within them, however if you wanted to keep that it, you can chain the CSS selector as such: soup.select(".row-fluid .span5 b"). If there was any elements of the span5 class outside of the row-fluid containers, it would be better to chain the CSS, making it more explicit.

get_names is a fairly ambiguous function name that also suggests it will return an iterable with names. What you are doing in this function is printing names of lawyers, together with the firm they're working for. I'd suggest renaming it to print_lawyer_information, or better yet get_lawyer_information and return a dictionary containing the name as key and title as value.

lawyer_information = {}
for element in soup.select(".span5 b"):
 name = element.text.strip()
 if name.startswith('Area'):
 continue
 title = element.next_sibling.next_sibling.strip()
 lawyer_information[name] = title
return lawyer_information

As you can see above, we're creating an empty dictionary and then, as we iterate over the elements, we append records to it. Once that is done, you just return the dictionary, which then you can print or do more things with. This can be done with a much neater dictionary comprehension.

return {row.text.strip(): row.next_sibling.next_sibling.strip()
 for row in soup.select('.span5 b')
 if not row.text.strip().startswith('Area')}

A couple other nitpicks would involve some PEP-8 (Python Style Guide) violations, such as lack of 2 lines before function definition and multiple statements on one line. Run your code through something like http://pep8online.com and you'll get a better idea of how to make your code easier to read and follow. You'll thank yourself in a years time when you look at your code again.

import requests
from bs4 import BeautifulSoup
def get_lawyer_information(url):
 response = requests.get(url)
 soup = BeautifulSoup(response.content, 'lxml')
 return {element.text.strip(): element.next_sibling.next_sibling.strip()
 for element in soup.select('.span5 b')
 if not element.text.strip().startswith('Area')}
if __name__ == '__main__':
 url = "https://www.otla.com/index.cfm?&m_firstname=a&seed=566485&fs_match=s&pg=publicdirectory&diraction=SearchResults&memPageNum=1"
 lawyer_information = get_lawyer_information(url)
 for name, title in lawyer_information.items():
 print('{} | {}'.format(name, title))

Question 6

This is undoubtedly the approach I wished to comply.

Question 7

@Topto Do you intend to iterate over next pages of the directory? If so, look into creating a spider, with something like scrapy. Otherwise requests.Session is something worth looking into, for consequent requests to the same host.

Question 8

Don't use a bare except statement if it's not absolutely necessary. Catch the errors you anticipate and explicitly want to handle.

Also, there's no reason for the duplicated try/except/else statements, and no use, in this case, of the else statements at all. You can collapse everything into a single try/except, since you do the same thing regardless of which line throws the error.

You also have a little code duplication. Don't Repeat Yourself (DRY).

def get_names(link):
 res = requests.get(link)
 soup = BeautifulSoup(res.text,'lxml')
 for content in soup.select(".row-fluid"):
 try:
 element = content.select_one(".span5 b")
 name = element.get_text(strip=True)
 title = element.next_sibling.next_sibling.strip()
 print(name,title)
 except AttributeError:
 continue

Question 9

Why don't you put the entire body of the while loop inside the try block? And you can just leave off the else clause.

for content in soup.select(".row-fluid"):
 try:
 name = content.select_one(".span5 b").get_text(strip=True)
 title = content.select_one(".span5 b").next_sibling.next_sibling.strip()
 print(name,title)
 except: continue #let go of blank elements

Question 10

A slight little problem with this approach is that if in a container an address is none then the script will not print the name either. I meant, they both will get printed at the same time or they both will be overlooked if any item is none.

Question 11

Yeah but that's the behavior of your original script also.

Question 12

Right you are. I didn't know that. I just checked.

Question 13

Maybe instead of continue you meant pass? That would allow the code to continue progressing through the loop body even if you can't find the value you're looking for. E.g., try: x = ...; except AttributeError: x = default_value

Question 14

I tried that in the first place but it gives lots of blank lines between the results that is the reason I used continue.

Luke Luke 1,1207 silver badges18 bronze badges · Accepted Answer · 2018-06-01 01:07:25Z

As @scnerd already discussed avoiding the use of a bare try/except clause, I'll focus on removing it entirely.

After inspecting website source, you will notice that elements of class row-fluid are containers for each lawyer. In each of those elements, there is two .span5 clearfix elements, one containing lawyer's name and firm, other their specialization. Since you don't seem to be interested in the latter, we can skip that element entirely and move onto the next one.

for element in soup.select(".span5 b"):
 name = element.text.strip()
 if name.startswith('Area'):
 continue
 title = element.next_sibling.next_sibling.strip()
 print('{}: {}'.format(name, title))

You will notice I left out row-fluid container from the selector, as we are iterating only over the span5 elements that are contained within them, however if you wanted to keep that it, you can chain the CSS selector as such: soup.select(".row-fluid .span5 b"). If there was any elements of the span5 class outside of the row-fluid containers, it would be better to chain the CSS, making it more explicit.

get_names is a fairly ambiguous function name that also suggests it will return an iterable with names. What you are doing in this function is printing names of lawyers, together with the firm they're working for. I'd suggest renaming it to print_lawyer_information, or better yet get_lawyer_information and return a dictionary containing the name as key and title as value.

lawyer_information = {}
for element in soup.select(".span5 b"):
 name = element.text.strip()
 if name.startswith('Area'):
 continue
 title = element.next_sibling.next_sibling.strip()
 lawyer_information[name] = title
return lawyer_information

As you can see above, we're creating an empty dictionary and then, as we iterate over the elements, we append records to it. Once that is done, you just return the dictionary, which then you can print or do more things with. This can be done with a much neater dictionary comprehension.

return {row.text.strip(): row.next_sibling.next_sibling.strip()
 for row in soup.select('.span5 b')
 if not row.text.strip().startswith('Area')}

A couple other nitpicks would involve some PEP-8 (Python Style Guide) violations, such as lack of 2 lines before function definition and multiple statements on one line. Run your code through something like http://pep8online.com and you'll get a better idea of how to make your code easier to read and follow. You'll thank yourself in a years time when you look at your code again.

import requests
from bs4 import BeautifulSoup
def get_lawyer_information(url):
 response = requests.get(url)
 soup = BeautifulSoup(response.content, 'lxml')
 return {element.text.strip(): element.next_sibling.next_sibling.strip()
 for element in soup.select('.span5 b')
 if not element.text.strip().startswith('Area')}
if __name__ == '__main__':
 url = "https://www.otla.com/index.cfm?&m_firstname=a&seed=566485&fs_match=s&pg=publicdirectory&diraction=SearchResults&memPageNum=1"
 lawyer_information = get_lawyer_information(url)
 for name, title in lawyer_information.items():
 print('{} | {}'.format(name, title))

@Topto Do you intend to iterate over next pages of the directory? If so, look into creating a spider, with something like scrapy. Otherwise requests.Session is something worth looking into, for consequent requests to the same host.

Stack Exchange Network

Web-scraping a list of lawyers

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Web-scraping a list of lawyers

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions