There's a lot of help on here but some of it goes over my head, so hopefully by asking my question and getting a tailored answer I will better understand.
So far I have managed to connect to a website, authenticate as a user, fill in a form and then pull down the html. The html contains a table I want. I just want to say some thing like:-
read html... when you read table start tags keep going until you reach table end tags and then disply that, or write it to a new html file and open it keeping the tags so it's formmated for me.
Here is the code I have so far.
# Use 'with' to ensure the session context is closed after use.
with requests.Session() as s:
s.post(LOGINURL, data=login)
# print
r = s.get(LOGINURL)
print r.url
# An authorised request.
r = s.get(APURL)
print r.url
# etc...
s.post(APURL)
#
r = s.post(APURL, data=findaps)
r = s.get(APURL)
#print r.text
f = open("makethisfile.html", "w")
f.write('\n'.join(['<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">',
'<html>',
' <head>',
' <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">',
' <title>THE TITLE</title>',
' <link rel="stylesheet" href="css/displayEventLists.css" type="text/css">',
r.text #this just does everything, i need to get the table.
])
)
f.close()
-
You should use at least HTMLParser docs.python.org/2/library/markup.html or maybe even something more powerfulpmod– pmod2014年04月01日 14:45:14 +00:00Commented Apr 1, 2014 at 14:45
-
Take a look also at beautifulsoup stackoverflow.com/questions/17196018/…pmod– pmod2014年04月01日 14:54:27 +00:00Commented Apr 1, 2014 at 14:54
1 Answer 1
Although it's best to parse the file properly, a quick-and-dirty method uses a regex.
m = re.search("<table.*?>(.+)</table>", r.text, re.S)
if (m):
print m.group()
else:
print "Error: table not found"
As an example of why parsing is better, the regex as written will fail with the following (rather contrived!) example:
<!-- <table> --> blah blah <table> this is the actual table </table>
And as written it will get the first table in the file. But you could just loop to get the 2nd, etc., (or make the regex specific to the table you want if possible) so that's not a problem.