This Python module contains helper functions to download the result page of SEDE queries and extract columns from it, most prominently:
fetch_sede_soup(label, url):
download the page of SEDE results, cache it or else use previously cached content, and return as a BeautifulSoup instancefetch_table(label, url):
fetch a SEDE page as a table, returned as a tuple of:- column meta data as a dictionary of name -> data
- rows as a list of lists
- other helper functions used by the above main functions
Here's the code:
import json
import logging
import os
import requests
from bs4 import BeautifulSoup
BASE_DIR = os.path.dirname(__file__)
CACHE_DIR = os.path.join(BASE_DIR, '.cache')
def fetch_sede_soup(label, url):
"""
Download the result page of a SEDE query and create a BeautifulSoup from it.
If the page contains results, cache it in a file.
If the page doesn't contain results, use the cache instead.
Note: this happens when the SEDE query is not executed in the browser
for a few days.
:param label: a simple name to represent the URL, it will be used as the cache filename
:param url: the URL to download
:return: a BeautifulSoup instance from the URL
"""
cache_path = os.path.join(CACHE_DIR, '{}.html'.format(label))
debug_cache_path = os.path.join(CACHE_DIR, '{}-debug.html'.format(label))
if not os.path.isdir(CACHE_DIR):
os.mkdir(CACHE_DIR)
logging.info('fetching {} as {}'.format(label, url))
html = requests.get(url).text
soup = BeautifulSoup(html)
def is_valid(soup):
for script in soup.findAll('script'):
if 'resultSets' in script.text:
return True
return False
def write_cache(path, html):
with open(path, 'w') as fh:
fh.write(html)
if is_valid(soup):
logging.info('updating cache')
write_cache(cache_path, html)
return soup
else:
logging.warning('result not valid')
write_cache(debug_cache_path, html)
if os.path.exists(cache_path):
logging.info('using previous cache')
with open(cache_path) as fh:
return BeautifulSoup(fh)
else:
logging.error('no previous cache: you must download the page manually')
return BeautifulSoup()
def fetch_table(label, url):
"""
Fetch a URL using `fetch_soup` and extract a table as a tuple of {cols} and [rows].
- {cols} is a mapping of column names to column meta data
- [rows] is a list of rows in the table
:param label: a simple name to represent the URL, it will be used as the cache filename
:param url: the URL to download
:return: a tuple of ({cols}, [rows])
"""
soup = fetch_sede_soup(label, url)
if not soup:
return {}, []
return extract_table(soup)
def transform_columns_meta(se_columns_meta):
"""
Transform SE column meta data, for example,
from:
[
{'name': 'User Link', 'type': 'User'},
{'name': 'Post Link', 'type': 'Post'}
]
to:
{
'User Link': {'name': 'User Link', 'type': 'User', 'index': 0},
'Post Link': {'name': 'Post Link', 'type': 'Post', 'index': 1}
}
:param se_columns_meta: list of dictionaries describing the fields
:return: dictionary of dictionaries, with index added
"""
columns_meta = {}
for index, se_col_meta in enumerate(se_columns_meta):
col_meta = {'index': index}
col_meta.update(se_col_meta)
columns_meta[se_col_meta['name']] = col_meta
return columns_meta
def extract_table(soup):
"""
Return a tuple of:
- meta data of columns as a dictionary
- list of rows
:param soup: a bs4 (BeautifulSoup) object
:return: a tuple of ({cols}, [rows])
"""
for script in soup.findAll('script'):
result_sets_col = 'resultSets'
if result_sets_col in script.text:
start = script.text.rindex('{', 0, script.text.index(result_sets_col))
end = script.text.index('}', script.text.index('querySetId')) + 1
data = json.loads(script.text[start:end])
results = data[result_sets_col][0]
columns = transform_columns_meta(results['columns'])
rows = results['rows']
return columns, rows
return {}, []
def extract_column(soup, colname):
"""
Return a generator of cell values in selected column.
For simple columns like timestamp, a cell value can be simple,
for example: 1414433013197
For more complex columns like Post Link, a cell value can be an object,
for example:
{
"id": 68102,
"title": "Bash Script - File Comment out & Notate"
}
:param soup: a bs4 (BeautifulSoup) object
:param colname: name of the SEDE column to extract
:return: generator of cell values in selected column
"""
cols, rows = extract_table(soup)
if colname not in cols:
return
index = cols[colname]['index']
for row in rows:
yield row[index]
An example usage of the module, getting the post ids:
cols, rows = sede.fetch_table('naruto', 'http://data.stackexchange.com/codereview/query/264586/naruto-accepted-answer-with-zero-score')
answer_id_index = cols['Post Link']['index']
post_ids = [row[answer_id_index]['id'] for row in rows]
I'm looking for feedback and criticism of any aspect of the code.
1 Answer 1
Helper functions to extract SEDE query results into more user-friendly format
It's good that all methods include a doc string. But notice that they are long and complicated. Especially the explanation of the returned non-trivial tuples is awkward. It's no wonder: tuples with complex values are awkward. These are signs that probably the design could be cleaner.
Unfortunately, the functions in this module don't exactly scream "user-friendly":
fetch_sede_soup(label, url)
:- Summary (what it does): download a URL, turn it into a BeautifulSoup instance
- In terms of usability, this "soup" doesn't help much. Of course there are the other methods that do something with it. But then why expose this method at all? It looks more like a low-level API method that would be better hidden. Rename it to
_fetch_sede_soup
to mark it private. - The
label
parameter doesn't seem to serve much purpose. The implementation uses this as the filename for caching the content. As such, it looks like an internal detail. The method should derive a suitable label by itself, without forcing callers to specify it.
fetch_table(label, url)
:- Summary (what it does): turn a url into a soup and convert that to a "table"
- The problem is that the returned value is rather cryptic, and not at all easy to use.
- The returned value is a tuple with non-trivial values, and a far cry from a "table".
- In fact, in order to understand the returned tuple, users would probably have to dig deep into the implementation of other functions called by this one,
extract_table
andtransform_columns_meta
too.
transform_columns_meta(se_columns_meta)
- Clearly intended for internal use. This should be renamed to
_transform_columns_meta
.
- Clearly intended for internal use. This should be renamed to
extract_table(soup)
- Used by
fetch_table
, and seems to be the source of all the usability issues. As mentioned above, the problem is the representation of the tuple. You need an intimate understanding of the tuple to be able to use it. It's not intuitive at all, you have to read the documentation (and it probably won't even help).
- Used by
extract_column(soup, colname)
- The single function that's actually easy to use: pass in a soup and a column name and it gives back a generator of the values.
- The generator seems pointless. The soup parser probably has read the entire HTML content during parsing, and SEDE results will never get excessively big. A simple list would be just as fine.
In short, the module is not user-friendly.
The interface is not clear,
it's really hard to figure out how to use the returned values.
It exposes too many functions,
and it's not clear how to use them.
It spills out too many implementation details,
such as the label
parameter,
and the notion of a "soup".
It would be better to set clearer goals, and hide more details, for example:
- Provide a function that takes a URL and returns a
Table
- A
Table
could have methods like:- Get the available column names
- Get the values of a column as a list
- Get rows as a list of dictionaries
- Notice that there's no mention of "soup" and caching. These are implementation details that don't need to be exposed to users.
Alternative implementation
Taking into account the above points, consider this alternative implementation:
import json
import logging
import os
import re
import requests
from bs4 import BeautifulSoup
BASE_DIR = os.path.dirname(__file__)
CACHE_DIR = os.path.join(BASE_DIR, '.cache')
class Table:
"""
Represent the results of a SEDE query.
For simple columns like timestamp, a cell value can be simple,
for example: 1414433013197
For more complex columns like Post Link, a cell value can be an object,
for example:
{
"id": 68102,
"title": "Bash Script - File Comment out & Notate"
}
"""
def __init__(self, columns=None, rows=None):
"""
Create a Table from columns meta data and rows.
:param columns: meta data of columns as a dict
:param rows: rows of the table as list of dict
:return: new Table instance
"""
if not columns:
columns = {}
if not rows:
rows = []
self._columns = columns
self._rows = rows
self._colnames = set(columns.keys())
@property
def colnames(self):
"""
Get list of column names
:return: list of column names
"""
return self._colnames
def column(self, name):
"""
Get column, by iterating over rows and extracting specified column.
:param name: name of the column to extract
:return: content of the column as a list
"""
index = self._columns[name]['index']
return [row[index] for row in self._rows]
def post_ids(self):
"""
Convenience method to extract the ids from a Post Link column.
:return: list of post ids
"""
return [post_link['id'] for post_link in self.column('Post Link')]
def fetch_table(url):
"""
Fetch a URL using `fetch_soup` and extract to a Table.
:param url: the URL to download
:return: the Table representing the SEDE results, or None if fetch failed
"""
soup = _fetch_sede_soup(url)
if not soup:
return None
return _extract_table(soup)
def _fetch_sede_soup(url):
"""
Download the result page of a SEDE query and create a BeautifulSoup from it.
If the page contains results, cache it.
If the page doesn't contain results, use the cache instead.
Note: this happens when the SEDE query is not executed in the browser
for a few days.
:param url: the URL to download
:return: a BeautifulSoup instance from the URL
"""
label = _url_to_slug(url)
cache_path = os.path.join(CACHE_DIR, '{}.html'.format(label))
debug_cache_path = os.path.join(CACHE_DIR, '{}-debug.html'.format(label))
if not os.path.isdir(CACHE_DIR):
os.mkdir(CACHE_DIR)
logging.info('fetching {}'.format(url))
html = requests.get(url).text
soup = BeautifulSoup(html)
def is_valid(soup):
for script in soup.findAll('script'):
if 'resultSets' in script.text:
return True
return False
def write_cache(path, html):
with open(path, 'w') as fh:
fh.write(html)
if is_valid(soup):
logging.info('updating cache')
write_cache(cache_path, html)
return soup
else:
logging.warning('result not valid')
write_cache(debug_cache_path, html)
if os.path.exists(cache_path):
logging.info('using previous cache')
with open(cache_path) as fh:
return BeautifulSoup(fh)
else:
logging.error('no previous cache: you must download the page manually')
return None
def _url_to_slug(url):
"""
Convert a URL to a slug, if possible, the last readable path segment.
>>> _url_to_slug('http://data.stackexchange.com/codereview/query/412155/ripe-zombies')
'ripe-zombies'
>>> _url_to_slug('http://data.stackexchange.com/codereview/query/412155/ripe-zombies/')
'ripe-zombies'
>>> _url_to_slug('http://data.stackexchange.com/codereview/query/412155/ripe-zo?>m^&bies/')
'ripe-zombies'
>>> _url_to_slug('')
'cache'
>>> _url_to_slug('#$%^')
'cache'
:param url: the URL of a SEDE query
:return: a slug extracted from the URL
"""
try:
slug = re.sub(r'[^a-z-]', '', re.search(r'([^/]+)/*$', url).group(1))
if slug:
return slug
except AttributeError:
pass
return 'cache'
def _transform_columns_meta(se_columns_meta):
"""
Transform SE column meta data, for example,
from:
[
{'name': 'User Link', 'type': 'User'},
{'name': 'Post Link', 'type': 'Post'}
]
to:
{
'User Link': {'name': 'User Link', 'type': 'User', 'index': 0},
'Post Link': {'name': 'Post Link', 'type': 'Post', 'index': 1}
}
:param se_columns_meta: list of dictionaries describing the fields
:return: dictionary of dictionaries, with index added
"""
columns_meta = {}
for index, se_col_meta in enumerate(se_columns_meta):
col_meta = {'index': index}
col_meta.update(se_col_meta)
columns_meta[se_col_meta['name']] = col_meta
return columns_meta
def _extract_table(soup):
"""
Return a Table representing the SEDE results
:param soup: a bs4 (BeautifulSoup) object
:return: a Table object
"""
for script in soup.findAll('script'):
result_sets_col = 'resultSets'
if result_sets_col in script.text:
start = script.text.rindex('{', 0, script.text.index(result_sets_col))
end = script.text.index('}', script.text.index('querySetId')) + 1
data = json.loads(script.text[start:end])
results = data[result_sets_col][0]
columns = _transform_columns_meta(results['columns'])
rows = results['rows']
return Table(columns, rows)
return None
What is exposed:
- a
Table
class, with simple, intuitive functions - a
fetch_table
function, that takes a URL and returns aTable
instance - the notion of a "soup" and a "label" are now (correctly) hidden, encapsulated in the implementation
The usage becomes simpler and more intuitive too:
table = sede.fetch_table('http://data.stackexchange.com/codereview/query/264586/naruto')
post_ids = table.post_ids()
The doc strings became simpler and clearer too, especially the ones on the public functions.
Explore related questions
See similar questions with these tags.