5
\$\begingroup\$

Problem statement:

Make country wise player list from the following html code

HTML CODE:

<ul>
 <li>
 Australia
 <ol>
 <li>Steven Smith</li>
 <li>David Warner</li>
 </ol>
 </li>
 <li>
 Bangladesh
 <ol>
 <li>Mashrafe Mortaza</li>
 <li>Tamim Iqbal</li>
 </ol>
 </li>
 <li>
 England
 <ol>
 <li>Eoin Morgan</li>
 <li>Jos Buttler</li>
 </ol>
 </li>
</ul>

Expected Output:

Australia- Steven Smith, David Warner

Bangladesh- Mashrafe Mortaza, Tamim Iqbal

England- Eoin Morgan, Jos Buttler

My Code:

import re
with open('playerlist.html', 'r') as f:
 text = f.read()
mytext = re.sub(r'[\n\t]', '', text)
pat = r'<li>(\w+?)<ol><li>(\w+\s?\w+)</li><li>(\w+\s?\w+)</li>'
cpat = re.compile(pat)
result = cpat.findall(mytext)
for a,b,c in result:
 print('{0}- {1}, {2}'.format(a,b,c))
asked Oct 27, 2017 at 8:26
\$\endgroup\$
1
  • 1
    \$\begingroup\$ Regexps and HTML don’t go so well together. Better to parse the DOM. \$\endgroup\$ Commented Oct 27, 2017 at 8:37

2 Answers 2

4
\$\begingroup\$

Regular expressions is not the right tool when it comes to parsing HTML. There are specialized HTML parsers that would do a better job resulting into a more robust and less fragile solution.

Just to name a few problems that exists in your current approach:

  • what if there are more than two players for a country
  • what if there are 0 players for a country
  • what if a country name contains a space or a single quote
  • what if a player's name consists of more than two words or contains a single quote
  • what if there are newlines after the opening or before the closing li tag

Instead, you may, for instance, use BeautifulSoup library:

from bs4 import BeautifulSoup
with open('playerlist.html', 'r') as input_file:
 soup = BeautifulSoup(input_file, "html.parser")
for country in soup.select("ul > li"):
 country_name = country.find(text=True, recursive=False).strip()
 players = [player.get_text(strip=True) for player in country.select("ol > li")]
 print('{country} - {players}'.format(country=country_name,
 players=', '.join(players)))
answered Oct 27, 2017 at 12:35
\$\endgroup\$
2
\$\begingroup\$

In addition to the hint of using a DOM parser given by others, you may also want to separate your concerns by splitting the parsing/grouping and processing/printing of the items.

from collections import defaultdict
from bs4 import BeautifulSoup
from bs4.element import NavigableString
def parse_players(html):
 """Parses players from the HTML text and groups them by country."""
 players_by_country = defaultdict(list)
 dom = BeautifulSoup(html, 'html5lib')
 ul = dom.find('ul')
 for li in ul.find_all('li', recursive=False):
 for item in li.contents:
 if isinstance(item, NavigableString):
 country = item.strip()
 break
 ol = li.find('ol', recursive=False)
 for li_ in ol.find_all('li', recursive=False):
 players_by_country[country].append(''.join(li_.contents).strip())
 return players_by_country
def print_players(players_by_country):
 """Formats players of each country."""
 for country, players in players_by_country.items():
 print('{}- {}'.format(country, ', '.join(players)))
if __name__ == '__main__':
 print_players(parse_players(HTML_TEXT))
answered Oct 27, 2017 at 12:40
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.