This is a follow up to a previous question.
As part of learning the object oriented approach and web scraping in Python, I've set out to write a program that will give me match results of professional Counter-Strike games, in order that appears on hltv.org. At first I just wanted a simple script that will download the website and get the results to print them out but I decided I don't have to stop there.
The program goes through the source code to find today's match results. Then, pieces of information like the winning team and their score are pulled out of that source code so they can be represented in certain ways.
I'd greatly appreciate feedback, so I can know what's good and what isn't about this code. If there is any improvements that could be implemented, I'd be eager to learn about them.
Changes
First I'd like to thank user alecxe for his useful feedback and directing me on the right track.
Class
Most obvious change of all, from the code from the previous question, is that I have moved everything into a class. This is so I can eventually make this into a handy enough module.
get_results
The for loop, which pulls the pieces of information from the source code is now part of a get_results
method.
In the first versions, I've completely omitted the possibility of a match ending in a tie. This can only happen if the match has a best-of-two format. The format is rather uncommon and it's usually adopted in group stages of smaller tournaments.
It came to me when I was trying to run the code and I got an unexpected AttributeError
. It took me a while to realise the code wasn't suddenly broken; the tags in the source code simply change, from team team-won
and team
to team
and team
. As the for loop was looking for team team-won
specifically, the search would return None
and the error would raise.
I'm not really comfortable for catching that particular error, but for now it works the way I want it to. If anyone knows a better way, I'd appreciate some feedback on it.
OrderedDict & Serializing
To keep order of the games, I've implemented an OrderedDict, as regular dictionaries don't preserve the key order. Then, the match_results
OrderedDict is dumped into JSON text. The data can be easily represented, as seen in the print_results
method.
I'm not really sure if this is the most efficient way, I know it works just fine for this purpose. I haven't really done much with JSON text before.
Code
#!/usr/bin/env python3
import json
from collections import OrderedDict
from time import localtime, strftime
import requests
from bs4 import BeautifulSoup
class ResultScraper:
MAPS = {
'mrg': 'Mirage',
'trn': 'Train',
'ovp': 'Overpass',
'inf': 'Inferno',
'cch': 'Cache',
'cbl': 'Cobblestone',
'nuke': 'Nuke',
'bo2': 'Best-of-two',
'bo3': 'Best-of-three',
'bo5': 'Best-of-five',
'-': 'Default win'
}
def __init__(self, stars=0):
self.url = 'https://www.hltv.org/results'
self.date = strftime('%d %B %Y')
if isinstance(stars, int) and 1 <= stars <= 5:
self.stars = stars
self.url += '?stars={}'.format(self.stars)
def scrape(self):
source = requests.get(self.url).text
return BeautifulSoup(source, 'lxml')
def check_match_dates(self, tag):
result_tag = tag.name == 'div' and 'result-con' in tag.get('class', [])
if not result_tag:
return False
timestamp = int(tag['data-zonedgrouping-entry-unix']) / 1000
return strftime('%d %B %Y', localtime(timestamp)) == self.date
def get_results(self):
match_results = OrderedDict()
soup = self.scrape()
for result in soup(self.check_match_dates):
timestamp = result['data-zonedgrouping-entry-unix']
event = result.select_one('.event-name').get_text()
map_played = result.select_one('.map-text').get_text()
try:
winning_team = result.select_one('.team.team-won').get_text()
winning_team_score = result.select_one('.score-won').get_text()
losing_team = result.select_one('.team.').get_text()
losing_team_score = result.select_one('.score-lost').get_text()
except AttributeError:
winning_team = result.select_one('.team1').get_text(strip=True)
losing_team = result.select_one('.team2').get_text(strip=True)
winning_team_score = result.select_one('.score-tie').get_text()
losing_team_score = winning_team_score
match_results[timestamp] = {
'winning_team': winning_team,
'winning_team_score': winning_team_score,
'losing_team': losing_team,
'losing_team_score': losing_team_score,
'event': event,
'map': self.MAPS[map_played]
}
return json.dumps(match_results, indent=4, separators=(',', ':'))
def print_results(self):
results = json.loads(self.get_results(), object_pairs_hook=OrderedDict)
if not results:
print('No match results for {}'.format(self.date))
else:
for match in results.values():
print('{winning_team:>20} {winning_team_score:<2} - '
'{losing_team_score:>2} {losing_team:<20}'
' {map:<13}'.format(**match))
print('\nCS:GO match results for {}'.format(self.date))
print('Powered by HLTV.org')
if __name__ == '__main__':
rs = ResultScraper()
rs.print_results()
-
2\$\begingroup\$ This is a follow-up question. Good work, btw. It looks great! :) \$\endgroup\$Grajdeanu Alex– Grajdeanu Alex2017年07月14日 13:42:15 +00:00Commented Jul 14, 2017 at 13:42
1 Answer 1
The code is really clean, great job!
I'm not really comfortable for catching that particular error, but for now it works the way I want it to. If anyone knows a better way, I'd appreciate some feedback on it.
This is perfectly fine - it is even, generally speaking, much better than catching a broad Exception
class. By catching a specific exception, you are not gonna miss a different exception if it is going to be raised in the future. More information at Should I always specify an exception type in except
statements?
I would just probably add a clarifying comment in the exception handling logic explaining why we need it and what case does it handle.
Some minor points:
- you should probably throw a
ValueError
in casestars
is not valid - you can work on making the code more modular - extracting things like dealing with timestamps into separate "library" functions - beware of God objects (it's not really an issue now, just something that may happen if the class is gonna grow this way)
- if performance matters, look into using faster third-party JSON libraries like
ujson
orsimplejson
- improve on documentation - add documentation strings to your class methods
- the trailing dot inside the
.team.
CSS selector can be removed we can further improve performance by parsing only the results list with the help of a
SoupStrainer
class:parse_only = SoupStrainer(class_='results-all') return BeautifulSoup(source, 'lxml', parse_only=parse_only)
Don't forget to import
SoupStrainer
frombs4
.
-
\$\begingroup\$ No need to raise
ValueError
at all! Anything that isn't anint
and outside of the specified range will just be ignored. Also, the leading dots are there to stay, removing them raises anAttributeError
which isn't desirable. UsingSoupStrainer
did improve performance slightly though! Working on extracting functions and docstrings now. Thanks so much, appreciate your effort. \$\endgroup\$Luke– Luke2017年07月14日 17:14:54 +00:00Commented Jul 14, 2017 at 17:14 -
\$\begingroup\$ @LukaszSalitra oops, did I say leading? Sorry, meant trailing. Fixed. Thank you! \$\endgroup\$alecxe– alecxe2017年07月14日 17:15:49 +00:00Commented Jul 14, 2017 at 17:15
Explore related questions
See similar questions with these tags.