Hacker News recently released an official API. Unfortunately, it cannot be used to get the newest posts submitted. I wrote a script to provide a faux streaming-api like method (like praw reddit's submission_stream
) so that I can simply put it in a loop and do something like make a gtk-notification when title contains the string 'python'.
"""
Usage:
from hn import HNStream
h = HNStream()
for item in h.stream():
print "Title: ", item.title
print "By user: ", item.by
"""
from lxml import html as lh
import re
import requests
import threading
import time
import ujson as json
__all__ = ['HNStream']
class Item:
def __init__(self, **kwargs):
self.__dict__.update(kwargs)
def __str__(self):
return "{}: {}".format(self.type, self.title)
__repr__ = __str__
class HackerNews:
def __init__(self):
self.base_url = 'https://hacker-news.firebaseio.com/v0/{api}'
def get(self, uri):
response = requests.get(uri)
if response.status_code == requests.codes.ok:
return json.loads(response.text)
else:
raise Exception('HTTP Error {}'.format(response.status_code))
def item(self, item_id):
uri = self.base_url.format(**{'api': 'item/{}.json'.format(item_id)})
return Item(**self.get(uri))
class SimpleFIFO:
def __init__(self, length):
self.length = length
self.values = length * [None]
def contains(self, iid):
return iid in self.values
def append(self, value):
assert isinstance(value, str), "Only append strings"
self.values.append(value)
while len(self.values) > self.length:
self.values.pop(0)
assert len(self.values) == self.length
def __str__(self):
return str(self.values)
class HNStream(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def stream(self):
itembuffer = SimpleFIFO(length=30)
while True:
r = requests.get('https://news.ycombinator.com/newest')
tree = lh.fromstring(r.text)
# http://stackoverflow.com/a/2756994
links = tree.xpath("//a[re:match(@href, 'item\?id=\d+')]/@href",
namespaces={"re": "http://exslt.org/regular-expressions"})
links = set(links)
for link in links:
iid = re.match(r'item\?id=(\d+)', str(link)).groups()[0]
if not itembuffer.contains(iid):
itembuffer.append(iid)
yield HackerNews().item(iid)
time.sleep(30)
1 Answer 1
I have a number of suggestions:
Don't use self.dict
def __init__(self, **kwargs):
self.__dict__.update(kwargs)
It is pretty unclear which attributes the class will have after calling the constructor, as it is basically up to the caller to decide this. This is not optimal, as you want to be in control of what data is set in an item. I'd suggest to use named arguments instead:
def __init__(self, id, type, title):
self.type = type
self.title = title
self.id = id
# Add other attributes that I don't know of?
While this may seem like blowing up the code, it makes it a lot easier to understand for an uninformed reader, and therefor should be preferred IMHO.
Append instead of format
Since you are basically just appending something to self.base_url
, you might as well append using +
instead of using format. I find it a lot easier to read, but that depends a lot on what you prefer. Consequently:
self.base_url = 'https://hacker-news.firebaseio.com/v0/'
Then
uri = self.base_url.format(**{'api': 'item/{}.json'.format(item_id)})
would look like this if using append:
uri = self.base_url + "item/{}.json".format(item_id)
HackerNews.get()
The get()
method's name does not explain what the method does. I'd suggest something offering a bit more documentation, for example: download_item()
or download_entry()
.
On a little side note: You can remove the else:
statement, as the raise Exception(..)
command will only be executed if the if
statement evaluates to False
. This depends a bit on personal taste though. I'd also change the ordering like this:
if response.status_code != requests.codes.ok:
raise Exception('HTTP Error {}'.format(response.status_code))
return json.loads(response.text)
I'd prefer this ordering as it has the return
statement at the end, which is where I'd look first when searching for it.
SimpleFIFO
Maybe you can use Python's Queue
class? See here for documentation: https://docs.python.org/2/library/queue.html
It also provides a max size, the only thing missing is the check for adding only strings.
HNStream.init()
You can use the super
keyword (see also: https://stackoverflow.com/questions/576169/understanding-python-super-and-init-methods) instead of explicitly calling the constructor like this:
def __init__(self):
super(HNSTream, self).__init__()
Or in Python 3:
def __init__(self):
super().__init__()
HNStream.stream()
Just a few naming suggestions:
r
-> resultiid
-> item_id
That was it, hope it helps!
-
\$\begingroup\$ Thanks a lot for the detailed review. I had fixed a few of these issues (collections.deque instead of SimpleFIFO, getting rid of dict.update) you can see the updated code here. I really like your explanation for having a return statement at the end, and semantic functions so I'll change the
get
method appropriately. Re:super
I want it to work with 2 and 3 sosuper(HNSTream, self).__init__()
seems to be a better choice. \$\endgroup\$pad– pad2014年11月23日 07:15:01 +00:00Commented Nov 23, 2014 at 7:15