I've built a web crawler using the BeautifulSoup
library that pulls stock ticker data from CSV files on Yahoo finance, and charts the data using matplotlib
. I'm wondering if there are any ways to improve the code I've written, because there are some parts that I think could be a lot better.
import urllib.request
from matplotlib import pyplot as plt
from bs4 import BeautifulSoup
import requests
def chartStocks(*tickers):
# Run loop for each ticker passed in as an argument
for ticker in tickers:
# Convert URL into text for parsing
url = "http://finance.yahoo.com/q/hp?s=" + str(ticker) + "+Historical+Prices"
sourceCode = requests.get(url)
plainText = sourceCode.text
soup = BeautifulSoup(plainText, "html.parser")
# Find all links on the page
for link in soup.findAll('a'):
href = link.get('href')
link = []
for c in href[:48]:
link.append(c)
link = ''.join(link)
# Find the URL for the stock ticker CSV file and convert the data to text
if link == "http://real-chart.finance.yahoo.com/table.csv?s=":
csv_url = href
res = urllib.request.urlopen(csv_url)
csv = res.read()
csv_str = str(csv)
# Parse the CSV to create a list of data points
point = []
points = []
curDay = 0
day = []
commas = 0
lines = csv_str.split("\\n")
lineOne = True
for line in lines:
commas = 0
if lineOne == True:
lineOne = False
else:
for c in line:
if c == ",":
commas += 1
if commas == 4:
point.append(c)
elif commas == 5:
for x in point:
if x == ",":
point.remove(x)
point = ''.join(point)
point = float(point)
points.append(point)
day.append(curDay)
curDay += 1
point = []
commas = 0
points = list(reversed(points))
# Plot the data
plt.plot(day,points)
plt.ylabel(ticker)
plt.show()
1 Answer 1
Composing small functions
chartStocks
would be more readable if you split it into a handful of smaller functions, roughly like this:
def chartStocks(*tickers):
for ticker in tickers:
page = getTickerPage(ticker)
csv_url = findCSVUrl(page)
csv = getCSV(csv_url)
day, points = parseCSV(csv)
plot_data(ticker, day, points)
# Or, if you're allergic to temporary variables:
day, points = parseCSV(getCSV(findCSVUrl(getTickerPage(ticker))))
This approach lets you clearly see the "pipeline" your data is passing through, and lets you test and reuse smaller pieces individually.
Arguably it would be even cleaner to define def chartStock(ticker)
to handle the case of one ticker, so that chartStocks
is just
def chartStocks(*tickers):
for ticker in tickers:
chartStock(ticker)
The only thing to be careful of here is to make sure to design your functions to handle errors properly -- either check that each return value is not None
before calling the next function, or allow them to take None
as a parameter and just return nothing in that case.
str.startswith
This:
# Find all links on the page
for link in soup.findAll('a'):
href = link.get('href')
link = []
for c in href[:48]:
link.append(c)
link = ''.join(link)
# Find the URL for the stock ticker CSV file and convert the data to text
if link == "http://real-chart.finance.yahoo.com/table.csv?s=":
# ...
could be simplified with str.startswith
:
def findCSVUrl(soupPage):
CSV_URL_PREFIX = 'http://real-chart.finance.yahoo.com/table.csv?s='
for link in soupPage.findAll('a'):
href = link.get('href', '')
if href.startswith(CSV_URL_PREFIX):
return href
I've also provided a not-found value of ''
so that if link
has no href
, startswith
won't get called on None
.
Skipping first row
Instead of using a lineOne
flag when looping over lines
:
lineOne = True
for line in lines:
if lineOne == True:
lineOne = False
else:
# continue parsing line...
you can just start from after the first row with a slice:
for line in lines[1:]:
# ... continue parsing line
CSV parsing
Python has a built-in CSV parsing module that could simplify a lot of your parsing. It would do the splitting-by-commas for you and return either a list or dict of fields for each row, depending on what you ask for. You would end up with roughly this:
def parseCSV(csv_text):
csv_rows = csv.reader(csv_text.split('\n'))
days = []
points = []
for day, row in enumerate(csv_rows):
close = float(row[4])
days.append(day)
points.append(close)
return days, points
where the enumerate
function would give you the same zero-based days list as you currently have.
Actually, since it seems like days
will just be the list [0 .. len(points)]
, you could skip the enumerate
and simply define days
after you've parsed all the points, and toss in a list comprehension if you'd like for good measure:
def parseCSV(csv_text):
csv_rows = csv.reader(csv_text.split('\n'))
points = [float(row[4]) for row in csv_rows]
days = list(range(len(points)))
return days, points
Explore related questions
See similar questions with these tags.