Serializing output of a match result web scraper

Question 1

This is a follow up to a previous question.

As part of learning the object oriented approach and web scraping in Python, I've set out to write a program that will give me match results of professional Counter-Strike games, in order that appears on hltv.org. At first I just wanted a simple script that will download the website and get the results to print them out but I decided I don't have to stop there.

The program goes through the source code to find today's match results. Then, pieces of information like the winning team and their score are pulled out of that source code so they can be represented in certain ways.

I'd greatly appreciate feedback, so I can know what's good and what isn't about this code. If there is any improvements that could be implemented, I'd be eager to learn about them.

Changes

First I'd like to thank user alecxe for his useful feedback and directing me on the right track.

Class

Most obvious change of all, from the code from the previous question, is that I have moved everything into a class. This is so I can eventually make this into a handy enough module.

`get_results`

The for loop, which pulls the pieces of information from the source code is now part of a get_results method.

In the first versions, I've completely omitted the possibility of a match ending in a tie. This can only happen if the match has a best-of-two format. The format is rather uncommon and it's usually adopted in group stages of smaller tournaments.

It came to me when I was trying to run the code and I got an unexpected AttributeError. It took me a while to realise the code wasn't suddenly broken; the tags in the source code simply change, from team team-won and team to team and team . As the for loop was looking for team team-won specifically, the search would return None and the error would raise.

I'm not really comfortable for catching that particular error, but for now it works the way I want it to. If anyone knows a better way, I'd appreciate some feedback on it.

OrderedDict & Serializing

To keep order of the games, I've implemented an OrderedDict, as regular dictionaries don't preserve the key order. Then, the match_results OrderedDict is dumped into JSON text. The data can be easily represented, as seen in the print_results method.

I'm not really sure if this is the most efficient way, I know it works just fine for this purpose. I haven't really done much with JSON text before.

Code

#!/usr/bin/env python3
import json
from collections import OrderedDict
from time import localtime, strftime
import requests
from bs4 import BeautifulSoup
class ResultScraper:
 MAPS = {
 'mrg': 'Mirage',
 'trn': 'Train',
 'ovp': 'Overpass',
 'inf': 'Inferno',
 'cch': 'Cache',
 'cbl': 'Cobblestone',
 'nuke': 'Nuke',
 'bo2': 'Best-of-two',
 'bo3': 'Best-of-three',
 'bo5': 'Best-of-five',
 '-': 'Default win'
 }
 def __init__(self, stars=0):
 self.url = 'https://www.hltv.org/results'
 self.date = strftime('%d %B %Y')
 if isinstance(stars, int) and 1 <= stars <= 5:
 self.stars = stars
 self.url += '?stars={}'.format(self.stars)
 def scrape(self):
 source = requests.get(self.url).text
 return BeautifulSoup(source, 'lxml')
 def check_match_dates(self, tag):
 result_tag = tag.name == 'div' and 'result-con' in tag.get('class', [])
 if not result_tag:
 return False
 timestamp = int(tag['data-zonedgrouping-entry-unix']) / 1000
 return strftime('%d %B %Y', localtime(timestamp)) == self.date
 def get_results(self):
 match_results = OrderedDict()
 soup = self.scrape()
 for result in soup(self.check_match_dates):
 timestamp = result['data-zonedgrouping-entry-unix']
 event = result.select_one('.event-name').get_text()
 map_played = result.select_one('.map-text').get_text()
 try:
 winning_team = result.select_one('.team.team-won').get_text()
 winning_team_score = result.select_one('.score-won').get_text()
 losing_team = result.select_one('.team.').get_text()
 losing_team_score = result.select_one('.score-lost').get_text()
 except AttributeError:
 winning_team = result.select_one('.team1').get_text(strip=True)
 losing_team = result.select_one('.team2').get_text(strip=True)
 winning_team_score = result.select_one('.score-tie').get_text()
 losing_team_score = winning_team_score
 match_results[timestamp] = {
 'winning_team': winning_team,
 'winning_team_score': winning_team_score,
 'losing_team': losing_team,
 'losing_team_score': losing_team_score,
 'event': event,
 'map': self.MAPS[map_played]
 }
 return json.dumps(match_results, indent=4, separators=(',', ':'))
 def print_results(self):
 results = json.loads(self.get_results(), object_pairs_hook=OrderedDict)
 if not results:
 print('No match results for {}'.format(self.date))
 else:
 for match in results.values():
 print('{winning_team:>20} {winning_team_score:<2} - '
 '{losing_team_score:>2} {losing_team:<20}'
 ' {map:<13}'.format(**match))
 print('\nCS:GO match results for {}'.format(self.date))
 print('Powered by HLTV.org')
if __name__ == '__main__':
 rs = ResultScraper()
 rs.print_results()

Question 2

This is a follow-up question. Good work, btw. It looks great! :)

Question 3

The code is really clean, great job!

I'm not really comfortable for catching that particular error, but for now it works the way I want it to. If anyone knows a better way, I'd appreciate some feedback on it.

This is perfectly fine - it is even, generally speaking, much better than catching a broad Exception class. By catching a specific exception, you are not gonna miss a different exception if it is going to be raised in the future. More information at Should I always specify an exception type in except statements?

I would just probably add a clarifying comment in the exception handling logic explaining why we need it and what case does it handle.

Some minor points:

you should probably throw a ValueError in case stars is not valid
you can work on making the code more modular - extracting things like dealing with timestamps into separate "library" functions - beware of God objects (it's not really an issue now, just something that may happen if the class is gonna grow this way)
if performance matters, look into using faster third-party JSON libraries like ujson or simplejson
improve on documentation - add documentation strings to your class methods
the trailing dot inside the .team. CSS selector can be removed
we can further improve performance by parsing only the results list with the help of a SoupStrainer class:
```
parse_only = SoupStrainer(class_='results-all') 
return BeautifulSoup(source, 'lxml', parse_only=parse_only)
```
Don't forget to import SoupStrainer from bs4.

Question 4

No need to raise ValueError at all! Anything that isn't an int and outside of the specified range will just be ignored. Also, the leading dots are there to stay, removing them raises an AttributeError which isn't desirable. Using SoupStrainer did improve performance slightly though! Working on extracting functions and docstrings now. Thanks so much, appreciate your effort.

Question 5

@LukaszSalitra oops, did I say leading? Sorry, meant trailing. Fixed. Thank you!

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2017-07-14 14:17:59Z

The code is really clean, great job!

I'm not really comfortable for catching that particular error, but for now it works the way I want it to. If anyone knows a better way, I'd appreciate some feedback on it.

This is perfectly fine - it is even, generally speaking, much better than catching a broad Exception class. By catching a specific exception, you are not gonna miss a different exception if it is going to be raised in the future. More information at Should I always specify an exception type in except statements?

I would just probably add a clarifying comment in the exception handling logic explaining why we need it and what case does it handle.

Some minor points:

you should probably throw a ValueError in case stars is not valid
you can work on making the code more modular - extracting things like dealing with timestamps into separate "library" functions - beware of God objects (it's not really an issue now, just something that may happen if the class is gonna grow this way)
if performance matters, look into using faster third-party JSON libraries like ujson or simplejson
improve on documentation - add documentation strings to your class methods
the trailing dot inside the .team. CSS selector can be removed
we can further improve performance by parsing only the results list with the help of a SoupStrainer class:
```
parse_only = SoupStrainer(class_='results-all') 
return BeautifulSoup(source, 'lxml', parse_only=parse_only)
```
Don't forget to import SoupStrainer from bs4.

No need to raise ValueError at all! Anything that isn't an int and outside of the specified range will just be ignored. Also, the leading dots are there to stay, removing them raises an AttributeError which isn't desirable. Using SoupStrainer did improve performance slightly though! Working on extracting functions and docstrings now. Thanks so much, appreciate your effort.
@LukaszSalitra oops, did I say leading? Sorry, meant trailing. Fixed. Thank you!

Stack Exchange Network

Serializing output of a match result web scraper

Changes

Class

`get_results`

OrderedDict & Serializing

Code

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Serializing output of a match result web scraper

Changes

Class

get_results

OrderedDict & Serializing

Code

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions

`get_results`