Counting Characters from an HTML File with Python

Question 1

I just completed level 2 of The Python Challenge on pythonchallenge.com and I am in the process of learning python so please bear with me and any silly mistakes I may have made.

I am looking for some feedback about what I could have done better in my code. Two areas specifically:

How could I have more easily identified the comment section of the HTML file? I used a beat-around-the-bush method that kind of found the end of the comment (or the beginning technically but it is counting from the end) and gave me some extra characters that I was able to recognize and anticipated (the extra "-->" and "-"). What condition would have better found this comment so I could put it in a new string to be counted?

This is what I wrote:

from collections import Counter
import requests
page = requests.get('http://www.pythonchallenge.com/pc/def/ocr.html')
pagetext = ""
pagetext = (page.text)
#find out what number we are going back to
i = 1
x = 4
testchar = ""
testcharstring = ""
while x == 4:
 testcharstring = pagetext[-i:]
 testchar = testcharstring[0]
 if testchar == "-":
 testcharstring = pagetext[-(i+1)]
 testchar = testcharstring[0]
 if testchar == "-":
 testcharstring = pagetext[-(i+2)]
 testchar = testcharstring[0]
 if testchar == "!":
 testcharstring = pagetext[-(i+3)]
 testchar = testcharstring[0]
 if testchar == "<":
 x = 3
 else:
 i += 1
 x = 4
 else:
 i += 1
 x = 4
 else:
 i += 1
print(i)
newstring = pagetext[-i:]
charcount = Counter(newstring)
print(charcount)

And this is the source HTML:

<html>
<head>
 <title>ocr</title>
 <link rel="stylesheet" type="text/css" href="../style.css">
</head>
<body>
<center><img src="ocr.jpg">
<br><font color="#c03000">
recognize the characters. maybe they are in the book, <br>but MAYBE they 
are in the page source.</center>
<br>
<br>
<br>
<font size="-1" color="gold">
General tips:
<li>Use the hints. They are helpful, most of the times.</li>
<li>Investigate the data given to you.</li>
<li>Avoid looking for spoilers.</li>
<br>
Forums: <a href="http://www.pythonchallenge.com/forums"/>Python Challenge Forums</a>, 
read before you post.
<br>
IRC: irc.freenode.net #pythonchallenge
<br><br>
To see the solutions to the previous level, replace pc with pcc, i.e. go 
to: http://www.pythonchallenge.com/pcc/def/ocr.html
</body>
</html>
<!--
find rare characters in the mess below:
-->
<!--

Followed by thousands of characters and the comment concludes with '-->'

Question 2

use HTML parser (eg. BeautifulSoup), which allows you to find nodes that are comments: stackoverflow.com/a/33139458/217723

Question 3

I don’t have enough reputation to comment, so I must say this in an answer. It looks clunky to use

 while x == 4:

and then do

 x = 3

whenever you want to break out of the loop. It looks better to do

 while True:

and when you want to break out of the loop do

 break

Cheers!

Question 4

Welcome to CodeReview, you will fit right in ;)

Question 5

Thx man. I appreciate it.

Question 6

Redundant Code

pagetext = ""
pagetext = (page.text)

The first line assigns an empty string to pagetext. The second line ignores the contents already in pagetext and assigns a different value to the variable.

Why bother with the first statement? It simply makes the code longer, slower, and harder to understand.

Why bother with the (...) around page.text? They also are not serving any purpose.

Variable Names

Variables like i are a double-edged sword. You're using it as a loop index, and then you're using it to reference a found location after the loop terminates. But i by itself doesn't have much meaning. posn might be clearer. last_comment_posn would be much clearer, though very verbose.

PEP-8 recommends using underscores to separate words in variable names: ie, use char_count not charcount etc.

Searching for a string of characters

Python strings have built-in functions for searching for a substring in a larger string. For instance, str.find could rapidly find the first occurrence of <!-- in the page text.

i = pagetext.find("<!--")

But you're not looking for the first one; you're looking for the last one. Python again has you covered, with the reverse find function: str.rfind.

i = pagetext.rfind("<!--")

But this still finds the index of the last occurrence. You want the characters after the comment marker, so we need to skip forward 4 additional characters:

if i >= 0:
 newstring = pagetext[i+4:]

Improved code

import requests
from collections import Counter
page = requests.get('http://www.pythonchallenge.com/pc/def/ocr.html')
page.raise_for_status() # Crash if the request didn't succeed
page_text = page.text
posn = page_text.rfind("<!--")
print(posn)
if posn >= 0:
 comment_text = page_text[posn+4:] # Fix! This is to end of string, not end of comment!
 char_count = Counter(comment_text)
 print(char_count)

fartgeek fartgeek 2671 silver badge12 bronze badges · Answer 1 · 2020-10-13 13:36:48Z

2

\$\begingroup\$

I don’t have enough reputation to comment, so I must say this in an answer. It looks clunky to use

 while x == 4:

and then do

 x = 3

whenever you want to break out of the loop. It looks better to do

 while True:

and when you want to break out of the loop do

 break

Cheers!

Share

answered Oct 13, 2020 at 13:36

fartgeek's user avatar

fartgeek fartgeek

2671 silver badge12 bronze badges

\$\endgroup\$

2

2

\$\begingroup\$ Welcome to CodeReview, you will fit right in ;) \$\endgroup\$

konijn
– konijn

2020年10月13日 14:54:33 +00:00
Commented Oct 13, 2020 at 14:54
\$\begingroup\$ Thx man. I appreciate it. \$\endgroup\$

fartgeek
– fartgeek

2020年10月13日 15:26:36 +00:00
Commented Oct 13, 2020 at 15:26

Add a comment |

AJNeufeld AJNeufeld 35.3k5 gold badges41 silver badges103 bronze badges · Answer 2 · 2020-10-13 22:51:30Z

Redundant Code

pagetext = ""
pagetext = (page.text)

The first line assigns an empty string to pagetext. The second line ignores the contents already in pagetext and assigns a different value to the variable.

Why bother with the first statement? It simply makes the code longer, slower, and harder to understand.

Why bother with the (...) around page.text? They also are not serving any purpose.

Variable Names

Variables like i are a double-edged sword. You're using it as a loop index, and then you're using it to reference a found location after the loop terminates. But i by itself doesn't have much meaning. posn might be clearer. last_comment_posn would be much clearer, though very verbose.

PEP-8 recommends using underscores to separate words in variable names: ie, use char_count not charcount etc.

Searching for a string of characters

Python strings have built-in functions for searching for a substring in a larger string. For instance, str.find could rapidly find the first occurrence of <!-- in the page text.

i = pagetext.find("<!--")

But you're not looking for the first one; you're looking for the last one. Python again has you covered, with the reverse find function: str.rfind.

i = pagetext.rfind("<!--")

But this still finds the index of the last occurrence. You want the characters after the comment marker, so we need to skip forward 4 additional characters:

if i >= 0:
 newstring = pagetext[i+4:]

Improved code

import requests
from collections import Counter
page = requests.get('http://www.pythonchallenge.com/pc/def/ocr.html')
page.raise_for_status() # Crash if the request didn't succeed
page_text = page.text
posn = page_text.rfind("<!--")
print(posn)
if posn >= 0:
 comment_text = page_text[posn+4:] # Fix! This is to end of string, not end of comment!
 char_count = Counter(comment_text)
 print(char_count)

Stack Exchange Network

Counting Characters from an HTML File with Python

2 Answers 2

Redundant Code

Variable Names

Searching for a string of characters

Improved code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Counting Characters from an HTML File with Python

2 Answers 2

Redundant Code

Variable Names

Searching for a string of characters

Improved code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions