Helper functions to extract SEDE query results into more user-friendly format

Question 1

This Python module contains helper functions to download the result page of SEDE queries and extract columns from it, most prominently:

fetch_sede_soup(label, url): download the page of SEDE results, cache it or else use previously cached content, and return as a BeautifulSoup instance
fetch_table(label, url): fetch a SEDE page as a table, returned as a tuple of:
- column meta data as a dictionary of name -> data
- rows as a list of lists
other helper functions used by the above main functions

Here's the code:

import json
import logging
import os
import requests
from bs4 import BeautifulSoup
BASE_DIR = os.path.dirname(__file__)
CACHE_DIR = os.path.join(BASE_DIR, '.cache')
def fetch_sede_soup(label, url):
 """
 Download the result page of a SEDE query and create a BeautifulSoup from it.
 If the page contains results, cache it in a file.
 If the page doesn't contain results, use the cache instead.
 Note: this happens when the SEDE query is not executed in the browser
 for a few days.
 :param label: a simple name to represent the URL, it will be used as the cache filename
 :param url: the URL to download
 :return: a BeautifulSoup instance from the URL
 """
 cache_path = os.path.join(CACHE_DIR, '{}.html'.format(label))
 debug_cache_path = os.path.join(CACHE_DIR, '{}-debug.html'.format(label))
 if not os.path.isdir(CACHE_DIR):
 os.mkdir(CACHE_DIR)
 logging.info('fetching {} as {}'.format(label, url))
 html = requests.get(url).text
 soup = BeautifulSoup(html)
 def is_valid(soup):
 for script in soup.findAll('script'):
 if 'resultSets' in script.text:
 return True
 return False
 def write_cache(path, html):
 with open(path, 'w') as fh:
 fh.write(html)
 if is_valid(soup):
 logging.info('updating cache')
 write_cache(cache_path, html)
 return soup
 else:
 logging.warning('result not valid')
 write_cache(debug_cache_path, html)
 if os.path.exists(cache_path):
 logging.info('using previous cache')
 with open(cache_path) as fh:
 return BeautifulSoup(fh)
 else:
 logging.error('no previous cache: you must download the page manually')
 return BeautifulSoup()
def fetch_table(label, url):
 """
 Fetch a URL using `fetch_soup` and extract a table as a tuple of {cols} and [rows].
 - {cols} is a mapping of column names to column meta data
 - [rows] is a list of rows in the table
 :param label: a simple name to represent the URL, it will be used as the cache filename
 :param url: the URL to download
 :return: a tuple of ({cols}, [rows])
 """
 soup = fetch_sede_soup(label, url)
 if not soup:
 return {}, []
 return extract_table(soup)
def transform_columns_meta(se_columns_meta):
 """
 Transform SE column meta data, for example,
 from:
 [
 {'name': 'User Link', 'type': 'User'},
 {'name': 'Post Link', 'type': 'Post'}
 ]
 to:
 {
 'User Link': {'name': 'User Link', 'type': 'User', 'index': 0},
 'Post Link': {'name': 'Post Link', 'type': 'Post', 'index': 1}
 }
 :param se_columns_meta: list of dictionaries describing the fields
 :return: dictionary of dictionaries, with index added
 """
 columns_meta = {}
 for index, se_col_meta in enumerate(se_columns_meta):
 col_meta = {'index': index}
 col_meta.update(se_col_meta)
 columns_meta[se_col_meta['name']] = col_meta
 return columns_meta
def extract_table(soup):
 """
 Return a tuple of:
 - meta data of columns as a dictionary
 - list of rows
 :param soup: a bs4 (BeautifulSoup) object
 :return: a tuple of ({cols}, [rows])
 """
 for script in soup.findAll('script'):
 result_sets_col = 'resultSets'
 if result_sets_col in script.text:
 start = script.text.rindex('{', 0, script.text.index(result_sets_col))
 end = script.text.index('}', script.text.index('querySetId')) + 1
 data = json.loads(script.text[start:end])
 results = data[result_sets_col][0]
 columns = transform_columns_meta(results['columns'])
 rows = results['rows']
 return columns, rows
 return {}, []
def extract_column(soup, colname):
 """
 Return a generator of cell values in selected column.
 For simple columns like timestamp, a cell value can be simple,
 for example: 1414433013197
 For more complex columns like Post Link, a cell value can be an object,
 for example:
 {
 "id": 68102,
 "title": "Bash Script - File Comment out & Notate"
 }
 :param soup: a bs4 (BeautifulSoup) object
 :param colname: name of the SEDE column to extract
 :return: generator of cell values in selected column
 """
 cols, rows = extract_table(soup)
 if colname not in cols:
 return
 index = cols[colname]['index']
 for row in rows:
 yield row[index]

An example usage of the module, getting the post ids:

cols, rows = sede.fetch_table('naruto', 'http://data.stackexchange.com/codereview/query/264586/naruto-accepted-answer-with-zero-score')
answer_id_index = cols['Post Link']['index']
post_ids = [row[answer_id_index]['id'] for row in rows]

I'm looking for feedback and criticism of any aspect of the code.

Question 2

Helper functions to extract SEDE query results into more user-friendly format

It's good that all methods include a doc string. But notice that they are long and complicated. Especially the explanation of the returned non-trivial tuples is awkward. It's no wonder: tuples with complex values are awkward. These are signs that probably the design could be cleaner.

Unfortunately, the functions in this module don't exactly scream "user-friendly":

fetch_sede_soup(label, url):
- Summary (what it does): download a URL, turn it into a BeautifulSoup instance
- In terms of usability, this "soup" doesn't help much. Of course there are the other methods that do something with it. But then why expose this method at all? It looks more like a low-level API method that would be better hidden. Rename it to _fetch_sede_soup to mark it private.
- The label parameter doesn't seem to serve much purpose. The implementation uses this as the filename for caching the content. As such, it looks like an internal detail. The method should derive a suitable label by itself, without forcing callers to specify it.
fetch_table(label, url):
- Summary (what it does): turn a url into a soup and convert that to a "table"
- The problem is that the returned value is rather cryptic, and not at all easy to use.
- The returned value is a tuple with non-trivial values, and a far cry from a "table".
- In fact, in order to understand the returned tuple, users would probably have to dig deep into the implementation of other functions called by this one, extract_table and transform_columns_meta too.
transform_columns_meta(se_columns_meta)
- Clearly intended for internal use. This should be renamed to _transform_columns_meta.
extract_table(soup)
- Used by fetch_table, and seems to be the source of all the usability issues. As mentioned above, the problem is the representation of the tuple. You need an intimate understanding of the tuple to be able to use it. It's not intuitive at all, you have to read the documentation (and it probably won't even help).
extract_column(soup, colname)
- The single function that's actually easy to use: pass in a soup and a column name and it gives back a generator of the values.
- The generator seems pointless. The soup parser probably has read the entire HTML content during parsing, and SEDE results will never get excessively big. A simple list would be just as fine.

In short, the module is not user-friendly. The interface is not clear, it's really hard to figure out how to use the returned values. It exposes too many functions, and it's not clear how to use them. It spills out too many implementation details, such as the label parameter, and the notion of a "soup".

It would be better to set clearer goals, and hide more details, for example:

Provide a function that takes a URL and returns a Table
A Table could have methods like:
- Get the available column names
- Get the values of a column as a list
- Get rows as a list of dictionaries
Notice that there's no mention of "soup" and caching. These are implementation details that don't need to be exposed to users.

Alternative implementation

Taking into account the above points, consider this alternative implementation:

import json
import logging
import os
import re
import requests
from bs4 import BeautifulSoup
BASE_DIR = os.path.dirname(__file__)
CACHE_DIR = os.path.join(BASE_DIR, '.cache')
class Table:
 """
 Represent the results of a SEDE query.
 For simple columns like timestamp, a cell value can be simple,
 for example: 1414433013197
 For more complex columns like Post Link, a cell value can be an object,
 for example:
 {
 "id": 68102,
 "title": "Bash Script - File Comment out & Notate"
 }
 """
 def __init__(self, columns=None, rows=None):
 """
 Create a Table from columns meta data and rows.
 :param columns: meta data of columns as a dict
 :param rows: rows of the table as list of dict
 :return: new Table instance
 """
 if not columns:
 columns = {}
 if not rows:
 rows = []
 self._columns = columns
 self._rows = rows
 self._colnames = set(columns.keys())
 @property
 def colnames(self):
 """
 Get list of column names
 :return: list of column names
 """
 return self._colnames
 def column(self, name):
 """
 Get column, by iterating over rows and extracting specified column.
 :param name: name of the column to extract
 :return: content of the column as a list
 """
 index = self._columns[name]['index']
 return [row[index] for row in self._rows]
 def post_ids(self):
 """
 Convenience method to extract the ids from a Post Link column.
 :return: list of post ids
 """
 return [post_link['id'] for post_link in self.column('Post Link')]
def fetch_table(url):
 """
 Fetch a URL using `fetch_soup` and extract to a Table.
 :param url: the URL to download
 :return: the Table representing the SEDE results, or None if fetch failed
 """
 soup = _fetch_sede_soup(url)
 if not soup:
 return None
 return _extract_table(soup)
def _fetch_sede_soup(url):
 """
 Download the result page of a SEDE query and create a BeautifulSoup from it.
 If the page contains results, cache it.
 If the page doesn't contain results, use the cache instead.
 Note: this happens when the SEDE query is not executed in the browser
 for a few days.
 :param url: the URL to download
 :return: a BeautifulSoup instance from the URL
 """
 label = _url_to_slug(url)
 cache_path = os.path.join(CACHE_DIR, '{}.html'.format(label))
 debug_cache_path = os.path.join(CACHE_DIR, '{}-debug.html'.format(label))
 if not os.path.isdir(CACHE_DIR):
 os.mkdir(CACHE_DIR)
 logging.info('fetching {}'.format(url))
 html = requests.get(url).text
 soup = BeautifulSoup(html)
 def is_valid(soup):
 for script in soup.findAll('script'):
 if 'resultSets' in script.text:
 return True
 return False
 def write_cache(path, html):
 with open(path, 'w') as fh:
 fh.write(html)
 if is_valid(soup):
 logging.info('updating cache')
 write_cache(cache_path, html)
 return soup
 else:
 logging.warning('result not valid')
 write_cache(debug_cache_path, html)
 if os.path.exists(cache_path):
 logging.info('using previous cache')
 with open(cache_path) as fh:
 return BeautifulSoup(fh)
 else:
 logging.error('no previous cache: you must download the page manually')
 return None
def _url_to_slug(url):
 """
 Convert a URL to a slug, if possible, the last readable path segment.
 >>> _url_to_slug('http://data.stackexchange.com/codereview/query/412155/ripe-zombies')
 'ripe-zombies'
 >>> _url_to_slug('http://data.stackexchange.com/codereview/query/412155/ripe-zombies/')
 'ripe-zombies'
 >>> _url_to_slug('http://data.stackexchange.com/codereview/query/412155/ripe-zo?>m^&bies/')
 'ripe-zombies'
 >>> _url_to_slug('')
 'cache'
 >>> _url_to_slug('#$%^')
 'cache'
 :param url: the URL of a SEDE query
 :return: a slug extracted from the URL
 """
 try:
 slug = re.sub(r'[^a-z-]', '', re.search(r'([^/]+)/*$', url).group(1))
 if slug:
 return slug
 except AttributeError:
 pass
 return 'cache'
def _transform_columns_meta(se_columns_meta):
 """
 Transform SE column meta data, for example,
 from:
 [
 {'name': 'User Link', 'type': 'User'},
 {'name': 'Post Link', 'type': 'Post'}
 ]
 to:
 {
 'User Link': {'name': 'User Link', 'type': 'User', 'index': 0},
 'Post Link': {'name': 'Post Link', 'type': 'Post', 'index': 1}
 }
 :param se_columns_meta: list of dictionaries describing the fields
 :return: dictionary of dictionaries, with index added
 """
 columns_meta = {}
 for index, se_col_meta in enumerate(se_columns_meta):
 col_meta = {'index': index}
 col_meta.update(se_col_meta)
 columns_meta[se_col_meta['name']] = col_meta
 return columns_meta
def _extract_table(soup):
 """
 Return a Table representing the SEDE results
 :param soup: a bs4 (BeautifulSoup) object
 :return: a Table object
 """
 for script in soup.findAll('script'):
 result_sets_col = 'resultSets'
 if result_sets_col in script.text:
 start = script.text.rindex('{', 0, script.text.index(result_sets_col))
 end = script.text.index('}', script.text.index('querySetId')) + 1
 data = json.loads(script.text[start:end])
 results = data[result_sets_col][0]
 columns = _transform_columns_meta(results['columns'])
 rows = results['rows']
 return Table(columns, rows)
 return None

What is exposed:

a Table class, with simple, intuitive functions
a fetch_table function, that takes a URL and returns a Table instance
the notion of a "soup" and a "label" are now (correctly) hidden, encapsulated in the implementation

The usage becomes simpler and more intuitive too:

table = sede.fetch_table('http://data.stackexchange.com/codereview/query/264586/naruto')
post_ids = table.post_ids()

The doc strings became simpler and clearer too, especially the ones on the public functions.

janos janos 113k15 gold badges154 silver badges396 bronze badges · Answer 1 · 2015-12-29 18:09:09Z

Helper functions to extract SEDE query results into more user-friendly format

It's good that all methods include a doc string. But notice that they are long and complicated. Especially the explanation of the returned non-trivial tuples is awkward. It's no wonder: tuples with complex values are awkward. These are signs that probably the design could be cleaner.

Unfortunately, the functions in this module don't exactly scream "user-friendly":

fetch_sede_soup(label, url):
- Summary (what it does): download a URL, turn it into a BeautifulSoup instance
- In terms of usability, this "soup" doesn't help much. Of course there are the other methods that do something with it. But then why expose this method at all? It looks more like a low-level API method that would be better hidden. Rename it to _fetch_sede_soup to mark it private.
- The label parameter doesn't seem to serve much purpose. The implementation uses this as the filename for caching the content. As such, it looks like an internal detail. The method should derive a suitable label by itself, without forcing callers to specify it.
fetch_table(label, url):
- Summary (what it does): turn a url into a soup and convert that to a "table"
- The problem is that the returned value is rather cryptic, and not at all easy to use.
- The returned value is a tuple with non-trivial values, and a far cry from a "table".
- In fact, in order to understand the returned tuple, users would probably have to dig deep into the implementation of other functions called by this one, extract_table and transform_columns_meta too.
transform_columns_meta(se_columns_meta)
- Clearly intended for internal use. This should be renamed to _transform_columns_meta.
extract_table(soup)
- Used by fetch_table, and seems to be the source of all the usability issues. As mentioned above, the problem is the representation of the tuple. You need an intimate understanding of the tuple to be able to use it. It's not intuitive at all, you have to read the documentation (and it probably won't even help).
extract_column(soup, colname)
- The single function that's actually easy to use: pass in a soup and a column name and it gives back a generator of the values.
- The generator seems pointless. The soup parser probably has read the entire HTML content during parsing, and SEDE results will never get excessively big. A simple list would be just as fine.

In short, the module is not user-friendly. The interface is not clear, it's really hard to figure out how to use the returned values. It exposes too many functions, and it's not clear how to use them. It spills out too many implementation details, such as the label parameter, and the notion of a "soup".

It would be better to set clearer goals, and hide more details, for example:

Provide a function that takes a URL and returns a Table
A Table could have methods like:
- Get the available column names
- Get the values of a column as a list
- Get rows as a list of dictionaries
Notice that there's no mention of "soup" and caching. These are implementation details that don't need to be exposed to users.

Alternative implementation

Taking into account the above points, consider this alternative implementation:

import json
import logging
import os
import re
import requests
from bs4 import BeautifulSoup
BASE_DIR = os.path.dirname(__file__)
CACHE_DIR = os.path.join(BASE_DIR, '.cache')
class Table:
 """
 Represent the results of a SEDE query.
 For simple columns like timestamp, a cell value can be simple,
 for example: 1414433013197
 For more complex columns like Post Link, a cell value can be an object,
 for example:
 {
 "id": 68102,
 "title": "Bash Script - File Comment out & Notate"
 }
 """
 def __init__(self, columns=None, rows=None):
 """
 Create a Table from columns meta data and rows.
 :param columns: meta data of columns as a dict
 :param rows: rows of the table as list of dict
 :return: new Table instance
 """
 if not columns:
 columns = {}
 if not rows:
 rows = []
 self._columns = columns
 self._rows = rows
 self._colnames = set(columns.keys())
 @property
 def colnames(self):
 """
 Get list of column names
 :return: list of column names
 """
 return self._colnames
 def column(self, name):
 """
 Get column, by iterating over rows and extracting specified column.
 :param name: name of the column to extract
 :return: content of the column as a list
 """
 index = self._columns[name]['index']
 return [row[index] for row in self._rows]
 def post_ids(self):
 """
 Convenience method to extract the ids from a Post Link column.
 :return: list of post ids
 """
 return [post_link['id'] for post_link in self.column('Post Link')]
def fetch_table(url):
 """
 Fetch a URL using `fetch_soup` and extract to a Table.
 :param url: the URL to download
 :return: the Table representing the SEDE results, or None if fetch failed
 """
 soup = _fetch_sede_soup(url)
 if not soup:
 return None
 return _extract_table(soup)
def _fetch_sede_soup(url):
 """
 Download the result page of a SEDE query and create a BeautifulSoup from it.
 If the page contains results, cache it.
 If the page doesn't contain results, use the cache instead.
 Note: this happens when the SEDE query is not executed in the browser
 for a few days.
 :param url: the URL to download
 :return: a BeautifulSoup instance from the URL
 """
 label = _url_to_slug(url)
 cache_path = os.path.join(CACHE_DIR, '{}.html'.format(label))
 debug_cache_path = os.path.join(CACHE_DIR, '{}-debug.html'.format(label))
 if not os.path.isdir(CACHE_DIR):
 os.mkdir(CACHE_DIR)
 logging.info('fetching {}'.format(url))
 html = requests.get(url).text
 soup = BeautifulSoup(html)
 def is_valid(soup):
 for script in soup.findAll('script'):
 if 'resultSets' in script.text:
 return True
 return False
 def write_cache(path, html):
 with open(path, 'w') as fh:
 fh.write(html)
 if is_valid(soup):
 logging.info('updating cache')
 write_cache(cache_path, html)
 return soup
 else:
 logging.warning('result not valid')
 write_cache(debug_cache_path, html)
 if os.path.exists(cache_path):
 logging.info('using previous cache')
 with open(cache_path) as fh:
 return BeautifulSoup(fh)
 else:
 logging.error('no previous cache: you must download the page manually')
 return None
def _url_to_slug(url):
 """
 Convert a URL to a slug, if possible, the last readable path segment.
 >>> _url_to_slug('http://data.stackexchange.com/codereview/query/412155/ripe-zombies')
 'ripe-zombies'
 >>> _url_to_slug('http://data.stackexchange.com/codereview/query/412155/ripe-zombies/')
 'ripe-zombies'
 >>> _url_to_slug('http://data.stackexchange.com/codereview/query/412155/ripe-zo?>m^&bies/')
 'ripe-zombies'
 >>> _url_to_slug('')
 'cache'
 >>> _url_to_slug('#$%^')
 'cache'
 :param url: the URL of a SEDE query
 :return: a slug extracted from the URL
 """
 try:
 slug = re.sub(r'[^a-z-]', '', re.search(r'([^/]+)/*$', url).group(1))
 if slug:
 return slug
 except AttributeError:
 pass
 return 'cache'
def _transform_columns_meta(se_columns_meta):
 """
 Transform SE column meta data, for example,
 from:
 [
 {'name': 'User Link', 'type': 'User'},
 {'name': 'Post Link', 'type': 'Post'}
 ]
 to:
 {
 'User Link': {'name': 'User Link', 'type': 'User', 'index': 0},
 'Post Link': {'name': 'Post Link', 'type': 'Post', 'index': 1}
 }
 :param se_columns_meta: list of dictionaries describing the fields
 :return: dictionary of dictionaries, with index added
 """
 columns_meta = {}
 for index, se_col_meta in enumerate(se_columns_meta):
 col_meta = {'index': index}
 col_meta.update(se_col_meta)
 columns_meta[se_col_meta['name']] = col_meta
 return columns_meta
def _extract_table(soup):
 """
 Return a Table representing the SEDE results
 :param soup: a bs4 (BeautifulSoup) object
 :return: a Table object
 """
 for script in soup.findAll('script'):
 result_sets_col = 'resultSets'
 if result_sets_col in script.text:
 start = script.text.rindex('{', 0, script.text.index(result_sets_col))
 end = script.text.index('}', script.text.index('querySetId')) + 1
 data = json.loads(script.text[start:end])
 results = data[result_sets_col][0]
 columns = _transform_columns_meta(results['columns'])
 rows = results['rows']
 return Table(columns, rows)
 return None

What is exposed:

a Table class, with simple, intuitive functions
a fetch_table function, that takes a URL and returns a Table instance
the notion of a "soup" and a "label" are now (correctly) hidden, encapsulated in the implementation

The usage becomes simpler and more intuitive too:

table = sede.fetch_table('http://data.stackexchange.com/codereview/query/264586/naruto')
post_ids = table.post_ids()

The doc strings became simpler and clearer too, especially the ones on the public functions.

Stack Exchange Network

Helper functions to extract SEDE query results into more user-friendly format

1 Answer 1

Alternative implementation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Helper functions to extract SEDE query results into more user-friendly format

1 Answer 1

Alternative implementation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions