I wrote a short program which should allow a user to specify a starting page in Discogs Wiki Style Guide, scrape the other styles listed on the page, and then output a graph (represented here as a dictionary of sets) of the relationship between subgenres.
I'm looking for guidance/critique on: (1) How to clean up the request_page function, I think there is a more elegant way both getting href attrs and filtering to only those with "/style/". (2) The general structure of the program. Self-taught and relative beginner so it's highly appreciated if anyone could point out general irregularities.
import re
import requests
from bs4 import BeautifulSoup
def get_related_styles(start):
def request_page(start):
response = requests.get('{0}{1}'.format(base_style_url, start))
soup = BeautifulSoup(response.content,'lxml')
## these lines feel inelegant. considered solutions with
## soup.findAll('a', attrs = {'href': pattern.match})
urls = [anchor.get('href') for anchor in soup.findAll('a')]
pattern = re.compile('/style/[a-zA-Z0-9\-]*[^/]') # can use lookback regex w/ escape chars?
style_urls = {pattern.match(url).group().replace('/style/','') for url in urls if pattern.match(url)}
return style_urls
def connect_styles(start , style_2):
## Nodes should not connect to self
## Note that styles are directed - e.g. (A ==> B) =/=> (B ==> A)
if start != style_2:
if start not in all_styles.keys():
all_styles[start] = {style_2}
else:
all_styles[start].add(style_2)
if style_2 not in do_not_visit:
do_not_visit.add(style_2)
get_related_styles(style_2)
style_urls = request_page(start)
for new_style in style_urls:
connect_styles(start,new_style)
Example Use:
start = 'Avant-garde-Jazz'
base_style_url = 'https://reference.discogslabs.com/style/'
all_styles = {}
do_not_visit = {start}
get_related_styles(start)
print(all_styles)
{'Free-Jazz': {'Free-Improvisation', 'Free-Funk'}, 'Free-Improvisation': {'Free-Jazz', 'Avant-garde-Jazz'}, 'Avant-garde-Jazz': {'Free-Jazz'}, 'Free-Funk': {'Free-Jazz'}}
1 Answer 1
There is a simpler way to filter out the "style" links - using a CSS selector with a partial match on the href
attribute:
style_urls = {anchor['href'].replace('/style/', '')
for anchor in soup.select('a[href^="/style/"]')]
where ^=
means "starts with".
Here we, of course, lose the check we had on the style name part of the href
. If this check is really needed, we can also use a regular expression to match the desired style links directly:
pattern = re.compile('/style/([a-zA-Z0-9\-]*)[^/]')
style_urls = {pattern.search(anchor['href']).group(1)
for anchor in soup('a', href=pattern)
soup()
here is a short way of doing soup.find_all()
.