5
\$\begingroup\$

I've built a web crawler using the BeautifulSoup library that pulls stock ticker data from CSV files on Yahoo finance, and charts the data using matplotlib. I'm wondering if there are any ways to improve the code I've written, because there are some parts that I think could be a lot better.

import urllib.request
from matplotlib import pyplot as plt
from bs4 import BeautifulSoup
import requests
def chartStocks(*tickers):
 # Run loop for each ticker passed in as an argument
 for ticker in tickers:
 # Convert URL into text for parsing
 url = "http://finance.yahoo.com/q/hp?s=" + str(ticker) + "+Historical+Prices"
 sourceCode = requests.get(url)
 plainText = sourceCode.text
 soup = BeautifulSoup(plainText, "html.parser")
 # Find all links on the page
 for link in soup.findAll('a'):
 href = link.get('href')
 link = []
 for c in href[:48]:
 link.append(c)
 link = ''.join(link)
 # Find the URL for the stock ticker CSV file and convert the data to text
 if link == "http://real-chart.finance.yahoo.com/table.csv?s=":
 csv_url = href
 res = urllib.request.urlopen(csv_url)
 csv = res.read()
 csv_str = str(csv)
 # Parse the CSV to create a list of data points
 point = []
 points = []
 curDay = 0
 day = []
 commas = 0 
 lines = csv_str.split("\\n")
 lineOne = True
 for line in lines:
 commas = 0
 if lineOne == True:
 lineOne = False
 else:
 for c in line:
 if c == ",":
 commas += 1
 if commas == 4:
 point.append(c)
 elif commas == 5:
 for x in point:
 if x == ",":
 point.remove(x)
 point = ''.join(point)
 point = float(point)
 points.append(point)
 day.append(curDay)
 curDay += 1
 point = []
 commas = 0
 points = list(reversed(points))
 # Plot the data
 plt.plot(day,points)
 plt.ylabel(ticker)
 plt.show()
Mast
13.8k12 gold badges56 silver badges127 bronze badges
asked Dec 21, 2015 at 6:01
\$\endgroup\$

1 Answer 1

4
\$\begingroup\$

Composing small functions

chartStocks would be more readable if you split it into a handful of smaller functions, roughly like this:

def chartStocks(*tickers):
 for ticker in tickers:
 page = getTickerPage(ticker)
 csv_url = findCSVUrl(page)
 csv = getCSV(csv_url)
 day, points = parseCSV(csv)
 plot_data(ticker, day, points)
 # Or, if you're allergic to temporary variables:
 day, points = parseCSV(getCSV(findCSVUrl(getTickerPage(ticker))))

This approach lets you clearly see the "pipeline" your data is passing through, and lets you test and reuse smaller pieces individually.

Arguably it would be even cleaner to define def chartStock(ticker) to handle the case of one ticker, so that chartStocks is just

def chartStocks(*tickers):
 for ticker in tickers:
 chartStock(ticker)

The only thing to be careful of here is to make sure to design your functions to handle errors properly -- either check that each return value is not None before calling the next function, or allow them to take None as a parameter and just return nothing in that case.

str.startswith

This:

# Find all links on the page
for link in soup.findAll('a'):
 href = link.get('href')
 link = []
 for c in href[:48]:
 link.append(c)
 link = ''.join(link)
 # Find the URL for the stock ticker CSV file and convert the data to text
 if link == "http://real-chart.finance.yahoo.com/table.csv?s=":
 # ...

could be simplified with str.startswith:

def findCSVUrl(soupPage):
 CSV_URL_PREFIX = 'http://real-chart.finance.yahoo.com/table.csv?s='
 for link in soupPage.findAll('a'):
 href = link.get('href', '')
 if href.startswith(CSV_URL_PREFIX):
 return href

I've also provided a not-found value of '' so that if link has no href, startswith won't get called on None.

Skipping first row

Instead of using a lineOne flag when looping over lines:

lineOne = True
for line in lines:
 if lineOne == True:
 lineOne = False
 else:
 # continue parsing line...

you can just start from after the first row with a slice:

for line in lines[1:]:
 # ... continue parsing line

CSV parsing

Python has a built-in CSV parsing module that could simplify a lot of your parsing. It would do the splitting-by-commas for you and return either a list or dict of fields for each row, depending on what you ask for. You would end up with roughly this:

def parseCSV(csv_text):
 csv_rows = csv.reader(csv_text.split('\n'))
 days = []
 points = []
 for day, row in enumerate(csv_rows):
 close = float(row[4])
 days.append(day)
 points.append(close)
 return days, points

where the enumerate function would give you the same zero-based days list as you currently have.

Actually, since it seems like days will just be the list [0 .. len(points)], you could skip the enumerate and simply define days after you've parsed all the points, and toss in a list comprehension if you'd like for good measure:

def parseCSV(csv_text):
 csv_rows = csv.reader(csv_text.split('\n'))
 points = [float(row[4]) for row in csv_rows]
 days = list(range(len(points)))
 return days, points
answered Dec 21, 2015 at 7:33
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.