Problem statement:
Make country wise player list from the following html code
HTML CODE:
<ul>
<li>
Australia
<ol>
<li>Steven Smith</li>
<li>David Warner</li>
</ol>
</li>
<li>
Bangladesh
<ol>
<li>Mashrafe Mortaza</li>
<li>Tamim Iqbal</li>
</ol>
</li>
<li>
England
<ol>
<li>Eoin Morgan</li>
<li>Jos Buttler</li>
</ol>
</li>
</ul>
Expected Output:
Australia- Steven Smith, David Warner
Bangladesh- Mashrafe Mortaza, Tamim Iqbal
England- Eoin Morgan, Jos Buttler
My Code:
import re
with open('playerlist.html', 'r') as f:
text = f.read()
mytext = re.sub(r'[\n\t]', '', text)
pat = r'<li>(\w+?)<ol><li>(\w+\s?\w+)</li><li>(\w+\s?\w+)</li>'
cpat = re.compile(pat)
result = cpat.findall(mytext)
for a,b,c in result:
print('{0}- {1}, {2}'.format(a,b,c))
-
1\$\begingroup\$ Regexps and HTML don’t go so well together. Better to parse the DOM. \$\endgroup\$morbusg– morbusg2017年10月27日 08:37:23 +00:00Commented Oct 27, 2017 at 8:37
2 Answers 2
Regular expressions is not the right tool when it comes to parsing HTML. There are specialized HTML parsers that would do a better job resulting into a more robust and less fragile solution.
Just to name a few problems that exists in your current approach:
- what if there are more than two players for a country
- what if there are 0 players for a country
- what if a country name contains a space or a single quote
- what if a player's name consists of more than two words or contains a single quote
- what if there are newlines after the opening or before the closing
li
tag
Instead, you may, for instance, use BeautifulSoup
library:
from bs4 import BeautifulSoup
with open('playerlist.html', 'r') as input_file:
soup = BeautifulSoup(input_file, "html.parser")
for country in soup.select("ul > li"):
country_name = country.find(text=True, recursive=False).strip()
players = [player.get_text(strip=True) for player in country.select("ol > li")]
print('{country} - {players}'.format(country=country_name,
players=', '.join(players)))
In addition to the hint of using a DOM parser given by others, you may also want to separate your concerns by splitting the parsing/grouping and processing/printing of the items.
from collections import defaultdict
from bs4 import BeautifulSoup
from bs4.element import NavigableString
def parse_players(html):
"""Parses players from the HTML text and groups them by country."""
players_by_country = defaultdict(list)
dom = BeautifulSoup(html, 'html5lib')
ul = dom.find('ul')
for li in ul.find_all('li', recursive=False):
for item in li.contents:
if isinstance(item, NavigableString):
country = item.strip()
break
ol = li.find('ol', recursive=False)
for li_ in ol.find_all('li', recursive=False):
players_by_country[country].append(''.join(li_.contents).strip())
return players_by_country
def print_players(players_by_country):
"""Formats players of each country."""
for country, players in players_by_country.items():
print('{}- {}'.format(country, ', '.join(players)))
if __name__ == '__main__':
print_players(parse_players(HTML_TEXT))