5
\$\begingroup\$

I love politics, and I love programming, so I figured why not try and combine the two for something to do? I'm making a work-in-progress (but runnable at this stage) Politico api that I call "pylitico.":

from bs4 import BeautifulSoup
import requests
import time
import ast
import re
story_link = re.compile('a href="(http:\/\/www.politico.com\/story.*)" target')
utag_regex = re.compile('var utag_data = \n(\{.*);')
today = time.strftime("%m/%d/%y")
class Article():
 def __init__(self, content_id, tags, author,
 datestamp, section, headline, story):
 """
 :type tags: list
 :type content_id: str
 :type author: list
 :type datestamp: DateTime
 :type section: str
 :type headline: str
 :type story: str
 """
 self.content_id = content_id
 self.tags = tags
 self.author = author
 self.datestamp = datestamp
 self.section = section
 self.headline = headline
 self.story = story
 def __str__(self):
 return "{0}".format(self.headline)
class Pylitico():
 def __init__(self):
 """Creates a connection to Politico"""
 self.session = requests.Session()
 def most_read(self):
 """Collects the Most Read section of Politico, returns
 stories as list of Article class objects"""
 r = self.session.get('http://www.politico.com/congress/?tab=most-read')
 soup = BeautifulSoup(r.content, 'html.parser')
 most_read_frame = [i for i in soup.find_all('div',
 {'class': 'dari-frame dari-frame-loaded'}) if
 'most-read' in i.attrs.get('name')][0]
 links = [i.find('a').attrs.get('href') for i in
 most_read_frame.find_all('article', {'class': 'story-frag format-xxs'})]
 stories = [self.story_parser(link) for link in links]
 return stories
 def todays_stories(self):
 """Collects stories posted on today's date, returns
 collected stories as list of Article class objects"""
 r = self.session.get('http://www.politico.com/search?q=')
 soup = BeautifulSoup(r.content, 'html.parser')
 summaries = soup.find_all('div', {'class': 'summary'})
 links = []
 for summary in summaries:
 if summary.find('time') and today in summary.find('time').text:
 links.append(summary.find('a').attrs.get('href'))
 stories = [self.story_parser(link) for link in links if 'video' not in link and 'tipsheets' not in link]
 return stories
 def story_parser(self, link):
 """Turns a POLITICO story into an Article class object."""
 r = self.session.get(link)
 soup = BeautifulSoup(r.content, 'html.parser')
 template_story = soup.find('body', id="pageStory")
 try:
 content_dict = ast.literal_eval(str(template_story.find('script')).replace(';', '').splitlines()[2])
 except AttributeError: # triggered if todays_stories() returns videos/other non-stories
 return
 all_divs = soup.find_all('div')
 for div in all_divs:
 try:
 if 'story-text' in div.attrs.get('class'):
 story_div = div
 except TypeError:
 continue
 story_text = []
 for i in story_div.find_all('p'):
 try:
 if 'byline' not in i.attrs.get('class'):
 story_text.append(i.text)
 except TypeError:
 story_text.append(i.text)
 story_text = ' '.join(story_text)
 a = Article(content_dict['content_id'], content_dict['content_tag'].split('|'),
 content_dict['content_author'].split('|'),
 content_dict['site_section'], time.strptime(content_dict['publication_date'], '%Y%m%d'),
 content_dict['current_headline'], story_text)
 return a
session = Pylitico()
most_read_stories = session.most_read()
for _ in most_read_stories[0:1]:
 print(_.headline)
 # Manafort denies reports of chaotic Trump campaign
todays_stories = session.todays_stories()
print(todays_stories[0].headline)
# More than two decades old, The Drudge Report hits a new traffic high

What do you guys think? See any optimizations that could be made? I know that BeautifulSoup parses a bit faster if you specify lxml instead of html.parser, but I thought that potential users may not have lxml.

200_success
146k22 gold badges190 silver badges479 bronze badges
asked Aug 15, 2016 at 14:33
\$\endgroup\$

1 Answer 1

3
\$\begingroup\$
 most_read_frame = [i for i in soup.find_all('div',
 {'class': 'dari-frame dari-frame-loaded'}) if
 'most-read' in i.attrs.get('name')][0]

can be made more efficient by using islice:

from itertools import islice
most_read_frame_gen = (i for i in soup.find_all('div',
 {'class': 'dari-frame dari-frame-loaded'}) if
 'most-read' in i.attrs.get('name'))
most_read_frame = islice(most_read_frame_gen, 0, 1)

as it will stop iterating after it gets the first value.

Also this is a bit of bad form:

for _ in most_read_stories[0:1]:
 print(_.headline)

_ is used for throwaway variables by convention. It'd be more readable to call it something like story or even just s:

for story in most_read_stories[0:1]:
 print(story.headline)

In general, though, it looks good. You do realize you're in a bit of an 'arms race' though, right? If Politico changes the format of its site, you'll have to change your code, etc, etc. In that vein, I suggest you document what date you made it work, so potential users can judge whether it's too out of date to be worth bothering with.

answered Aug 21, 2016 at 6:56
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.