Here is a simple web crawler I wrote in Python 3 that counts the words in each page in a domain, displays word count per page, and sums them up.
Tested on a multitude of web pages, works correctly so far.
What do you think of my code?
import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup
base_url = "https://independencerpgs.com/"
html = urlopen(base_url)
bsObj = BeautifulSoup(html.read(), features="html.parser");
urls=[]
for link in bsObj.find_all('a'):
if link.get('href') not in urls:
urls.append(link.get('href'))
else:
pass
print(urls)
words=0
for url in urls:
if url not in ["NULL", "_blank", "None", None, "NoneType"]:
if url[0] == "/":
url=url[1:]
if base_url in url:
if base_url == url:
pass
if base_url != url and "https://"in url:
url=url[len(base_url)-1:]
if "http://" in url:
specific_url=url
elif "https://" in url:
specific_url = url
else:
specific_url = base_url + url
r = requests.get(specific_url)
soup = BeautifulSoup(r.text, features="html.parser")
for script in soup(["script", "style"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
text_list = text.split()
print(f"{specific_url}: {len(text_list)} words")
words += len(text_list)
else:
pass
print(f"Total: {words} words")
-
\$\begingroup\$ How resilient/correct of a program are you aiming for? The main difficulty I see is finding and validating all the URLs. Also, what do you mean by domain, exactly? \$\endgroup\$AMC– AMC2020年03月13日 22:54:56 +00:00Commented Mar 13, 2020 at 22:54
-
\$\begingroup\$ I might have chosen the wrong term; the idea is to scan a whole site, find all the htm/html pages on it, and count the words in each. \$\endgroup\$Omer G. Joel– Omer G. Joel2020年03月13日 23:27:20 +00:00Commented Mar 13, 2020 at 23:27
-
1\$\begingroup\$ the idea is to scan a whole site Your code appears to be parsing the links found on a single page, though. all the htm/html pages on it Can you be more specific still? count the words in each The issue of defining what a word is still stands. \$\endgroup\$AMC– AMC2020年03月13日 23:31:22 +00:00Commented Mar 13, 2020 at 23:31
1 Answer 1
It's readable and gets the job done as a script, but there is some redundancy in your if cases. Here's a streamlined version:
import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup
base_url = "https://independencerpgs.com/"
html = urlopen(base_url)
bsObj = BeautifulSoup(html.read(), features="html.parser");
# Sets handle duplicate items automatically
urls=set()
for link in bsObj.find_all('a'):
urls.add(link.get('href'))
print(urls)
words=0
for url in urls:
# Having the logic this way saves 4 spaces indent on all the logic
if url in ["NULL", "_blank", "None", None, "NoneType", base_url]:
continue
# The == base_url case is handled in the above `if`
if url[0] == "/":
specific_url = base_url + url # requests.get does not care about the num of '/'
else:
specific_url = url
r = requests.get(specific_url)
soup = BeautifulSoup(r.text, features="html.parser")
for script in soup(["script", "style"]):
# Use clear rather than extract
script.clear()
# text is text you don't need to preprocess it just yet.
text = soup.get_text()
print(f"{specific_url}: {len(text)} words")
words += len(text)
print(f"Total: {words} words")
Depending on what you intend to do with that code, you might want to put it into functions or even create a class around it too.
-
\$\begingroup\$ Thanks! Will probably include it in a function once I create a UI for it, where the user enters the URL rather than have it coded into the script itself. \$\endgroup\$Omer G. Joel– Omer G. Joel2020年03月13日 07:39:02 +00:00Commented Mar 13, 2020 at 7:39
-
\$\begingroup\$ Looking at the "words += len(text)" statement, this script counts characters, not words. You need "words += len(text.split())". The OP has this correctly. \$\endgroup\$Fijoy Vadakkumpadan– Fijoy Vadakkumpadan2022年12月04日 02:14:21 +00:00Commented Dec 4, 2022 at 2:14