Better way to extract country wise player list from html file using regex

Question 1

Problem statement:

Make country wise player list from the following html code

HTML CODE:

<ul>
 <li>
 Australia
 <ol>
 <li>Steven Smith</li>
 <li>David Warner</li>
 </ol>
 </li>
 <li>
 Bangladesh
 <ol>
 <li>Mashrafe Mortaza</li>
 <li>Tamim Iqbal</li>
 </ol>
 </li>
 <li>
 England
 <ol>
 <li>Eoin Morgan</li>
 <li>Jos Buttler</li>
 </ol>
 </li>
</ul>

Expected Output:

Australia- Steven Smith, David Warner

Bangladesh- Mashrafe Mortaza, Tamim Iqbal

England- Eoin Morgan, Jos Buttler

My Code:

import re
with open('playerlist.html', 'r') as f:
 text = f.read()
mytext = re.sub(r'[\n\t]', '', text)
pat = r'<li>(\w+?)<ol><li>(\w+\s?\w+)</li><li>(\w+\s?\w+)</li>'
cpat = re.compile(pat)
result = cpat.findall(mytext)
for a,b,c in result:
 print('{0}- {1}, {2}'.format(a,b,c))

Question 2

Regexps and HTML don’t go so well together. Better to parse the DOM.

Question 3

Regular expressions is not the right tool when it comes to parsing HTML. There are specialized HTML parsers that would do a better job resulting into a more robust and less fragile solution.

Just to name a few problems that exists in your current approach:

what if there are more than two players for a country
what if there are 0 players for a country
what if a country name contains a space or a single quote
what if a player's name consists of more than two words or contains a single quote
what if there are newlines after the opening or before the closing li tag

Instead, you may, for instance, use BeautifulSoup library:

from bs4 import BeautifulSoup
with open('playerlist.html', 'r') as input_file:
 soup = BeautifulSoup(input_file, "html.parser")
for country in soup.select("ul > li"):
 country_name = country.find(text=True, recursive=False).strip()
 players = [player.get_text(strip=True) for player in country.select("ol > li")]
 print('{country} - {players}'.format(country=country_name,
 players=', '.join(players)))

Question 4

In addition to the hint of using a DOM parser given by others, you may also want to separate your concerns by splitting the parsing/grouping and processing/printing of the items.

from collections import defaultdict
from bs4 import BeautifulSoup
from bs4.element import NavigableString
def parse_players(html):
 """Parses players from the HTML text and groups them by country."""
 players_by_country = defaultdict(list)
 dom = BeautifulSoup(html, 'html5lib')
 ul = dom.find('ul')
 for li in ul.find_all('li', recursive=False):
 for item in li.contents:
 if isinstance(item, NavigableString):
 country = item.strip()
 break
 ol = li.find('ol', recursive=False)
 for li_ in ol.find_all('li', recursive=False):
 players_by_country[country].append(''.join(li_.contents).strip())
 return players_by_country
def print_players(players_by_country):
 """Formats players of each country."""
 for country, players in players_by_country.items():
 print('{}- {}'.format(country, ', '.join(players)))
if __name__ == '__main__':
 print_players(parse_players(HTML_TEXT))

alecxe alecxealecxe 17.5k8 gold badges52 silver badges93 bronze badges · Answer 1 · 2017-10-27 12:35:11Z

Regular expressions is not the right tool when it comes to parsing HTML. There are specialized HTML parsers that would do a better job resulting into a more robust and less fragile solution.

Just to name a few problems that exists in your current approach:

what if there are more than two players for a country
what if there are 0 players for a country
what if a country name contains a space or a single quote
what if a player's name consists of more than two words or contains a single quote
what if there are newlines after the opening or before the closing li tag

Instead, you may, for instance, use BeautifulSoup library:

from bs4 import BeautifulSoup
with open('playerlist.html', 'r') as input_file:
 soup = BeautifulSoup(input_file, "html.parser")
for country in soup.select("ul > li"):
 country_name = country.find(text=True, recursive=False).strip()
 players = [player.get_text(strip=True) for player in country.select("ol > li")]
 print('{country} - {players}'.format(country=country_name,
 players=', '.join(players)))

score 2 · Answer 2 · 2017-10-27 12:40:34Z

In addition to the hint of using a DOM parser given by others, you may also want to separate your concerns by splitting the parsing/grouping and processing/printing of the items.

from collections import defaultdict
from bs4 import BeautifulSoup
from bs4.element import NavigableString
def parse_players(html):
 """Parses players from the HTML text and groups them by country."""
 players_by_country = defaultdict(list)
 dom = BeautifulSoup(html, 'html5lib')
 ul = dom.find('ul')
 for li in ul.find_all('li', recursive=False):
 for item in li.contents:
 if isinstance(item, NavigableString):
 country = item.strip()
 break
 ol = li.find('ol', recursive=False)
 for li_ in ol.find_all('li', recursive=False):
 players_by_country[country].append(''.join(li_.contents).strip())
 return players_by_country
def print_players(players_by_country):
 """Formats players of each country."""
 for country, players in players_by_country.items():
 print('{}- {}'.format(country, ', '.join(players)))
if __name__ == '__main__':
 print_players(parse_players(HTML_TEXT))

Stack Exchange Network

Better way to extract country wise player list from html file using regex

Problem statement:

HTML CODE:

Expected Output:

My Code:

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Better way to extract country wise player list from html file using regex

Problem statement:

HTML CODE:

Expected Output:

My Code:

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions