Python Politico API attempt

Question 1

I love politics, and I love programming, so I figured why not try and combine the two for something to do? I'm making a work-in-progress (but runnable at this stage) Politico api that I call "pylitico.":

from bs4 import BeautifulSoup
import requests
import time
import ast
import re
story_link = re.compile('a href="(http:\/\/www.politico.com\/story.*)" target')
utag_regex = re.compile('var utag_data = \n(\{.*);')
today = time.strftime("%m/%d/%y")
class Article():
 def __init__(self, content_id, tags, author,
 datestamp, section, headline, story):
 """
 :type tags: list
 :type content_id: str
 :type author: list
 :type datestamp: DateTime
 :type section: str
 :type headline: str
 :type story: str
 """
 self.content_id = content_id
 self.tags = tags
 self.author = author
 self.datestamp = datestamp
 self.section = section
 self.headline = headline
 self.story = story
 def __str__(self):
 return "{0}".format(self.headline)
class Pylitico():
 def __init__(self):
 """Creates a connection to Politico"""
 self.session = requests.Session()
 def most_read(self):
 """Collects the Most Read section of Politico, returns
 stories as list of Article class objects"""
 r = self.session.get('http://www.politico.com/congress/?tab=most-read')
 soup = BeautifulSoup(r.content, 'html.parser')
 most_read_frame = [i for i in soup.find_all('div',
 {'class': 'dari-frame dari-frame-loaded'}) if
 'most-read' in i.attrs.get('name')][0]
 links = [i.find('a').attrs.get('href') for i in
 most_read_frame.find_all('article', {'class': 'story-frag format-xxs'})]
 stories = [self.story_parser(link) for link in links]
 return stories
 def todays_stories(self):
 """Collects stories posted on today's date, returns
 collected stories as list of Article class objects"""
 r = self.session.get('http://www.politico.com/search?q=')
 soup = BeautifulSoup(r.content, 'html.parser')
 summaries = soup.find_all('div', {'class': 'summary'})
 links = []
 for summary in summaries:
 if summary.find('time') and today in summary.find('time').text:
 links.append(summary.find('a').attrs.get('href'))
 stories = [self.story_parser(link) for link in links if 'video' not in link and 'tipsheets' not in link]
 return stories
 def story_parser(self, link):
 """Turns a POLITICO story into an Article class object."""
 r = self.session.get(link)
 soup = BeautifulSoup(r.content, 'html.parser')
 template_story = soup.find('body', id="pageStory")
 try:
 content_dict = ast.literal_eval(str(template_story.find('script')).replace(';', '').splitlines()[2])
 except AttributeError: # triggered if todays_stories() returns videos/other non-stories
 return
 all_divs = soup.find_all('div')
 for div in all_divs:
 try:
 if 'story-text' in div.attrs.get('class'):
 story_div = div
 except TypeError:
 continue
 story_text = []
 for i in story_div.find_all('p'):
 try:
 if 'byline' not in i.attrs.get('class'):
 story_text.append(i.text)
 except TypeError:
 story_text.append(i.text)
 story_text = ' '.join(story_text)
 a = Article(content_dict['content_id'], content_dict['content_tag'].split('|'),
 content_dict['content_author'].split('|'),
 content_dict['site_section'], time.strptime(content_dict['publication_date'], '%Y%m%d'),
 content_dict['current_headline'], story_text)
 return a
session = Pylitico()
most_read_stories = session.most_read()
for _ in most_read_stories[0:1]:
 print(_.headline)
 # Manafort denies reports of chaotic Trump campaign
todays_stories = session.todays_stories()
print(todays_stories[0].headline)
# More than two decades old, The Drudge Report hits a new traffic high

What do you guys think? See any optimizations that could be made? I know that BeautifulSoup parses a bit faster if you specify lxml instead of html.parser, but I thought that potential users may not have lxml.

Question 2

 most_read_frame = [i for i in soup.find_all('div',
 {'class': 'dari-frame dari-frame-loaded'}) if
 'most-read' in i.attrs.get('name')][0]

can be made more efficient by using islice:

from itertools import islice
most_read_frame_gen = (i for i in soup.find_all('div',
 {'class': 'dari-frame dari-frame-loaded'}) if
 'most-read' in i.attrs.get('name'))
most_read_frame = islice(most_read_frame_gen, 0, 1)

as it will stop iterating after it gets the first value.

Also this is a bit of bad form:

for _ in most_read_stories[0:1]:
 print(_.headline)

_ is used for throwaway variables by convention. It'd be more readable to call it something like story or even just s:

for story in most_read_stories[0:1]:
 print(story.headline)

In general, though, it looks good. You do realize you're in a bit of an 'arms race' though, right? If Politico changes the format of its site, you'll have to change your code, etc, etc. In that vein, I suggest you document what date you made it work, so potential users can judge whether it's too out of date to be worth bothering with.

pjz pjz 2,40613 silver badges15 bronze badges · Accepted Answer · 2016-08-21 06:56:58Z

 most_read_frame = [i for i in soup.find_all('div',
 {'class': 'dari-frame dari-frame-loaded'}) if
 'most-read' in i.attrs.get('name')][0]

can be made more efficient by using islice:

from itertools import islice
most_read_frame_gen = (i for i in soup.find_all('div',
 {'class': 'dari-frame dari-frame-loaded'}) if
 'most-read' in i.attrs.get('name'))
most_read_frame = islice(most_read_frame_gen, 0, 1)

as it will stop iterating after it gets the first value.

Also this is a bit of bad form:

for _ in most_read_stories[0:1]:
 print(_.headline)

_ is used for throwaway variables by convention. It'd be more readable to call it something like story or even just s:

for story in most_read_stories[0:1]:
 print(story.headline)

In general, though, it looks good. You do realize you're in a bit of an 'arms race' though, right? If Politico changes the format of its site, you'll have to change your code, etc, etc. In that vein, I suggest you document what date you made it work, so potential users can judge whether it's too out of date to be worth bothering with.

Stack Exchange Network

Python Politico API attempt

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python Politico API attempt

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions