I just completed level 2 of The Python Challenge on pythonchallenge.com and I am in the process of learning python so please bear with me and any silly mistakes I may have made.
I am looking for some feedback about what I could have done better in my code. Two areas specifically:
- How could I have more easily identified the comment section of the HTML file? I used a beat-around-the-bush method that kind of found the end of the comment (or the beginning technically but it is counting from the end) and gave me some extra characters that I was able to recognize and anticipated (the extra "-->" and "-"). What condition would have better found this comment so I could put it in a new string to be counted?
This is what I wrote:
from collections import Counter
import requests
page = requests.get('http://www.pythonchallenge.com/pc/def/ocr.html')
pagetext = ""
pagetext = (page.text)
#find out what number we are going back to
i = 1
x = 4
testchar = ""
testcharstring = ""
while x == 4:
testcharstring = pagetext[-i:]
testchar = testcharstring[0]
if testchar == "-":
testcharstring = pagetext[-(i+1)]
testchar = testcharstring[0]
if testchar == "-":
testcharstring = pagetext[-(i+2)]
testchar = testcharstring[0]
if testchar == "!":
testcharstring = pagetext[-(i+3)]
testchar = testcharstring[0]
if testchar == "<":
x = 3
else:
i += 1
x = 4
else:
i += 1
x = 4
else:
i += 1
print(i)
newstring = pagetext[-i:]
charcount = Counter(newstring)
print(charcount)
And this is the source HTML:
<html>
<head>
<title>ocr</title>
<link rel="stylesheet" type="text/css" href="../style.css">
</head>
<body>
<center><img src="ocr.jpg">
<br><font color="#c03000">
recognize the characters. maybe they are in the book, <br>but MAYBE they
are in the page source.</center>
<br>
<br>
<br>
<font size="-1" color="gold">
General tips:
<li>Use the hints. They are helpful, most of the times.</li>
<li>Investigate the data given to you.</li>
<li>Avoid looking for spoilers.</li>
<br>
Forums: <a href="http://www.pythonchallenge.com/forums"/>Python Challenge Forums</a>,
read before you post.
<br>
IRC: irc.freenode.net #pythonchallenge
<br><br>
To see the solutions to the previous level, replace pc with pcc, i.e. go
to: http://www.pythonchallenge.com/pcc/def/ocr.html
</body>
</html>
<!--
find rare characters in the mess below:
-->
<!--
Followed by thousands of characters and the comment concludes with '-->'
-
2\$\begingroup\$ use HTML parser (eg. BeautifulSoup), which allows you to find nodes that are comments: stackoverflow.com/a/33139458/217723 \$\endgroup\$yedpodtrzitko– yedpodtrzitko2020年10月13日 12:12:20 +00:00Commented Oct 13, 2020 at 12:12
2 Answers 2
I don’t have enough reputation to comment, so I must say this in an answer. It looks clunky to use
while x == 4:
and then do
x = 3
whenever you want to break out of the loop. It looks better to do
while True:
and when you want to break out of the loop do
break
Cheers!
Redundant Code
pagetext = ""
pagetext = (page.text)
The first line assigns an empty string to pagetext
. The second line ignores the contents already in pagetext
and assigns a different value to the variable.
Why bother with the first statement? It simply makes the code longer, slower, and harder to understand.
Why bother with the (...)
around page.text
? They also are not serving any purpose.
Variable Names
Variables like i
are a double-edged sword. You're using it as a loop index, and then you're using it to reference a found location after the loop terminates. But i
by itself doesn't have much meaning. posn
might be clearer. last_comment_posn
would be much clearer, though very verbose.
PEP-8 recommends using underscores to separate words in variable names: ie, use char_count
not charcount
etc.
Searching for a string of characters
Python strings have built-in functions for searching for a substring in a larger string. For instance, str.find
could rapidly find the first occurrence of <!--
in the page text.
i = pagetext.find("<!--")
But you're not looking for the first one; you're looking for the last one. Python again has you covered, with the reverse find function: str.rfind
.
i = pagetext.rfind("<!--")
But this still finds the index of the last occurrence. You want the characters after the comment marker, so we need to skip forward 4 additional characters:
if i >= 0:
newstring = pagetext[i+4:]
Improved code
import requests
from collections import Counter
page = requests.get('http://www.pythonchallenge.com/pc/def/ocr.html')
page.raise_for_status() # Crash if the request didn't succeed
page_text = page.text
posn = page_text.rfind("<!--")
print(posn)
if posn >= 0:
comment_text = page_text[posn+4:] # Fix! This is to end of string, not end of comment!
char_count = Counter(comment_text)
print(char_count)