What's the best way to parse this HTML tag?

John Salerno johnjsal at gmail.com
Sun Mar 11 18:53:47 EDT 2012


I'm using Beautiful Soup to extract some song information from a radio
station's website that lists the songs it plays as it plays them.
Getting the time that the song is played is easy, because the time is
wrapped in a <div> tag all by itself with a class attribute that has a
specific value I can search for. But the actual song title and artist
information is harder, because the HTML isn't quite as precise. Here's
a sample:
<div class="cmPlaylistContent">
 <strong>
 <a href="/lsp/t2995/">
 Love Without End, Amen
 </a>
 </strong>
 <br/>
 <a href="/lsp/a436/">
 George Strait
 </a>
 <br/>
 <span class="sprite iconDownload">
 </span>
 Download Song:
 <a href="http://itunes.apple.com/us/album/love-without-end-amen/
id71416?i=71404&uo=4">
 iTunes
 </a>
 |
 <a href="http://www.amazon.com/Love-Without-End-Amen/dp/B000V638BQ?
SubscriptionId=1NXYFBZST44V8CCDK182&tag=coxradiointer-20&linkCode=xm2&camp=2025&creative=165953&creativeASIN=B000V638BQ">
 Amazon MP3
 </a>
 <br/>
 <span class="sprite iconComments">
 Comments  (1)
 </span>
 <span class="sprite iconVoteUp">
 Votes  (1)
 </span>
</div>
This is about as far as I can drill down without getting TOO specific.
I simply find the <div> tags with the "cmPlaylistContent" class. This
tag contains both the song title and the artist name, and sometimes
miscellaneous other information as well, like a way to vote for the
song or links to purchase it from iTunes or Amazon.
So my question is, given the above HTML, how can I best extract the
song title and artist name? It SEEMS like they are always the first
two pieces of information in the tag, such that:
for item in div.stripped_strings: print(item)
Love Without End, Amen
George Strait
Download Song:
iTunes
|Amazon MP3
Comments  (1)
Votes  (1)
and I could simply get the first two items returned by that generator.
It's not quite as clean as I'd like, because I have no idea if
anything could ever be inserted before either of these items, thus
messing it all up.
I also don't want to rely on the <strong> tag, which makes me shudder,
or the <a> tag, because I don't know if they will always have an href.
Ideall, the <a> tag would have also had an attribute that labeled the
title as the title, and the artist as the artist, but alas.....
Therefore, I appeal to your greater wisdom in these matters. Given
this HTML, is there a "best practice" for how to refer to the song
title and artist?
Thanks!


More information about the Python-list mailing list

AltStyle によって変換されたページ (->オリジナル) /