Return to Revisions

2 of 2

replaced http://stackoverflow.com/ with https://stackoverflow.com/

edited May 23, 2017 at 12:41

Community Bot

edited May 23, 2017 at 12:41

Community Bot

The right tool

As you've said, you are not using the right tool for this task : you can't parse HTML with regexps.

A better approach would be to use an already existing parser like BeautifulSoup.

A simpler container

At the moment, you are putting data in multiple lists to zip them all at the very end. It can be a very nice technique but in our case, you put in different containers things that actually belong together. Also, you have a risk of adding too many elements in a list and having information zipped with information that should be in a different row. An easier option is to have a single list where each elements contain everything you've parsed.

Also, you can take this chance to rewrite in a more straightforward way the parts where you add something a list and then refer to it with my_list[-1].

company_data = []
for i in range(len(table)):
 if bool(re.search(pattern = tick_ident_nasdaq, string = table[i])):
 exchange = "NASDAQ"
 elif bool(re.search(pattern = tick_ident_nyse, string = table[i])):
 exchange = "NYSE"
 else:
 exchange = None
 if exchange:
 ticker = re.search(pattern = name_grab, string = table[i]).group(1)
 name = re.search(pattern = name_grab, string = table[i + 1]).group(1)
 name = re.sub(pattern = "&amp;", repl = "&", string = name)
 cig = re.search(pattern = cigs_grab, string = table[i + 3]).group(1)
 cig = re.sub(pattern = "&amp;", repl = "&", string = cig)
 cig_sub = re.search(pattern = cigs_grab, string = table[i + 4]).group(1)
 cig_sub = re.sub(pattern = "&amp;", repl = "&", string = cig_sub)
 company_data.append((ticker, exchange, name, cig, cig_sub))

Compile your regexp

You can compile regexp if you plan to reuse them many times. It is more efficient and it makes it possible to use them like any Python object.

# Define regex used for parsing
tick_ident_nasdaq = re.compile('href=\"http:\/\/www\.nasdaq\.com\/symbol\/')
tick_ident_nyse = re.compile('href=\"https:\/\/www.nyse.com\/quote\/')
name_grab = re.compile('\">(.+)<\/a></td>')
cigs_grab = re.compile('^<td>(.+)</td>')
amp_re = re.compile("&amp;")
company_data = []
for i in range(len(table)):
 if bool(tick_ident_nasdaq.search(string = table[i])):
 exchange = "NASDAQ"
 elif bool(tick_ident_nyse.search(string = table[i])):
 exchange = "NYSE"
 else:
 exchange = None
 if exchange:
 ticker = name_grab.search(string = table[i]).group(1)
 name = name_grab.search(string = table[i + 1]).group(1)
 name = amp_re.sub(repl = "&", string = name)
 cig = cigs_grab.search(string = table[i + 3]).group(1)
 cig = amp_re.sub(repl = "&", string = cig)
 cig_sub = cigs_grab.search(string = table[i + 4]).group(1)
 cig_sub = amp_re.sub(repl = "&", string = cig_sub)
 company_data.append((ticker, exchange, name, cig, cig_sub))

"&" and "&"

What you are trying to do when substituing "&" with "&":

deserves to but put in a function on its own
actually corresponds to a common problem already solved : HTML entity decoding.

answered Feb 26, 2017 at 8:30

SylvainD

29.7k
1
49
93

default