1
\$\begingroup\$

Here is a simple web crawler I wrote in Python 3 that counts the words in each page in a domain, displays word count per page, and sums them up.

Tested on a multitude of web pages, works correctly so far.

What do you think of my code?

import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup
base_url = "https://independencerpgs.com/"
html = urlopen(base_url)
bsObj = BeautifulSoup(html.read(), features="html.parser");
urls=[]
for link in bsObj.find_all('a'):
 if link.get('href') not in urls:
 urls.append(link.get('href'))
 else:
 pass
print(urls)
words=0
for url in urls:
 if url not in ["NULL", "_blank", "None", None, "NoneType"]:
 if url[0] == "/":
 url=url[1:]
 if base_url in url:
 if base_url == url:
 pass
 if base_url != url and "https://"in url:
 url=url[len(base_url)-1:]
 if "http://" in url:
 specific_url=url
 elif "https://" in url:
 specific_url = url
 else:
 specific_url = base_url + url
 r = requests.get(specific_url)
 soup = BeautifulSoup(r.text, features="html.parser")
 for script in soup(["script", "style"]):
 script.extract()
 text = soup.get_text()
 lines = (line.strip() for line in text.splitlines())
 chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
 text = '\n'.join(chunk for chunk in chunks if chunk)
 text_list = text.split()
 print(f"{specific_url}: {len(text_list)} words")
 words += len(text_list)
 else:
 pass
print(f"Total: {words} words")
asked Mar 12, 2020 at 18:38
\$\endgroup\$
3
  • \$\begingroup\$ How resilient/correct of a program are you aiming for? The main difficulty I see is finding and validating all the URLs. Also, what do you mean by domain, exactly? \$\endgroup\$ Commented Mar 13, 2020 at 22:54
  • \$\begingroup\$ I might have chosen the wrong term; the idea is to scan a whole site, find all the htm/html pages on it, and count the words in each. \$\endgroup\$ Commented Mar 13, 2020 at 23:27
  • 1
    \$\begingroup\$ the idea is to scan a whole site Your code appears to be parsing the links found on a single page, though. all the htm/html pages on it Can you be more specific still? count the words in each The issue of defining what a word is still stands. \$\endgroup\$ Commented Mar 13, 2020 at 23:31

1 Answer 1

1
\$\begingroup\$

It's readable and gets the job done as a script, but there is some redundancy in your if cases. Here's a streamlined version:

import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup
base_url = "https://independencerpgs.com/"
html = urlopen(base_url)
bsObj = BeautifulSoup(html.read(), features="html.parser");
# Sets handle duplicate items automatically
urls=set()
for link in bsObj.find_all('a'):
 urls.add(link.get('href'))
print(urls)
words=0
for url in urls:
 # Having the logic this way saves 4 spaces indent on all the logic
 if url in ["NULL", "_blank", "None", None, "NoneType", base_url]:
 continue
 # The == base_url case is handled in the above `if`
 if url[0] == "/":
 specific_url = base_url + url # requests.get does not care about the num of '/'
 else:
 specific_url = url
 r = requests.get(specific_url)
 soup = BeautifulSoup(r.text, features="html.parser")
 for script in soup(["script", "style"]):
 # Use clear rather than extract
 script.clear()
 # text is text you don't need to preprocess it just yet.
 text = soup.get_text()
 print(f"{specific_url}: {len(text)} words")
 words += len(text)
print(f"Total: {words} words")

Depending on what you intend to do with that code, you might want to put it into functions or even create a class around it too.

answered Mar 12, 2020 at 23:07
\$\endgroup\$
2
  • \$\begingroup\$ Thanks! Will probably include it in a function once I create a UI for it, where the user enters the URL rather than have it coded into the script itself. \$\endgroup\$ Commented Mar 13, 2020 at 7:39
  • \$\begingroup\$ Looking at the "words += len(text)" statement, this script counts characters, not words. You need "words += len(text.split())". The OP has this correctly. \$\endgroup\$ Commented Dec 4, 2022 at 2:14

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.