Recursive Web Scraping with Python Beautiful Soup

Question 1

I wrote a short program which should allow a user to specify a starting page in Discogs Wiki Style Guide, scrape the other styles listed on the page, and then output a graph (represented here as a dictionary of sets) of the relationship between subgenres.

I'm looking for guidance/critique on: (1) How to clean up the request_page function, I think there is a more elegant way both getting href attrs and filtering to only those with "/style/". (2) The general structure of the program. Self-taught and relative beginner so it's highly appreciated if anyone could point out general irregularities.

import re
import requests 
from bs4 import BeautifulSoup 
def get_related_styles(start):
 def request_page(start):
 response = requests.get('{0}{1}'.format(base_style_url, start))
 soup = BeautifulSoup(response.content,'lxml')
 ## these lines feel inelegant. considered solutions with
 ## soup.findAll('a', attrs = {'href': pattern.match})
 urls = [anchor.get('href') for anchor in soup.findAll('a')]
 pattern = re.compile('/style/[a-zA-Z0-9\-]*[^/]') # can use lookback regex w/ escape chars?
 style_urls = {pattern.match(url).group().replace('/style/','') for url in urls if pattern.match(url)}
 return style_urls
 def connect_styles(start , style_2):
 ## Nodes should not connect to self
 ## Note that styles are directed - e.g. (A ==> B) =/=> (B ==> A)
 if start != style_2:
 if start not in all_styles.keys():
 all_styles[start] = {style_2}
 else:
 all_styles[start].add(style_2)
 if style_2 not in do_not_visit:
 do_not_visit.add(style_2)
 get_related_styles(style_2)
 style_urls = request_page(start)
 for new_style in style_urls:
 connect_styles(start,new_style)

Example Use:

start = 'Avant-garde-Jazz'
base_style_url = 'https://reference.discogslabs.com/style/'
all_styles = {}
do_not_visit = {start}
get_related_styles(start)
print(all_styles)
{'Free-Jazz': {'Free-Improvisation', 'Free-Funk'}, 'Free-Improvisation': {'Free-Jazz', 'Avant-garde-Jazz'}, 'Avant-garde-Jazz': {'Free-Jazz'}, 'Free-Funk': {'Free-Jazz'}}

Question 2

There is a simpler way to filter out the "style" links - using a CSS selector with a partial match on the href attribute:

style_urls = {anchor['href'].replace('/style/', '') 
 for anchor in soup.select('a[href^="/style/"]')]

where ^= means "starts with".

Here we, of course, lose the check we had on the style name part of the href. If this check is really needed, we can also use a regular expression to match the desired style links directly:

pattern = re.compile('/style/([a-zA-Z0-9\-]*)[^/]')
style_urls = {pattern.search(anchor['href']).group(1)
 for anchor in soup('a', href=pattern)

soup() here is a short way of doing soup.find_all().

alecxe alecxealecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2018-01-04 04:30:53Z

There is a simpler way to filter out the "style" links - using a CSS selector with a partial match on the href attribute:

style_urls = {anchor['href'].replace('/style/', '') 
 for anchor in soup.select('a[href^="/style/"]')]

where ^= means "starts with".

Here we, of course, lose the check we had on the style name part of the href. If this check is really needed, we can also use a regular expression to match the desired style links directly:

pattern = re.compile('/style/([a-zA-Z0-9\-]*)[^/]')
style_urls = {pattern.search(anchor['href']).group(1)
 for anchor in soup('a', href=pattern)

soup() here is a short way of doing soup.find_all().

Stack Exchange Network

Recursive Web Scraping with Python Beautiful Soup

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Recursive Web Scraping with Python Beautiful Soup

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions