Python web crawler counting words in all pages in a domain

Question 1

Here is a simple web crawler I wrote in Python 3 that counts the words in each page in a domain, displays word count per page, and sums them up.

Tested on a multitude of web pages, works correctly so far.

What do you think of my code?

import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup
base_url = "https://independencerpgs.com/"
html = urlopen(base_url)
bsObj = BeautifulSoup(html.read(), features="html.parser");
urls=[]
for link in bsObj.find_all('a'):
 if link.get('href') not in urls:
 urls.append(link.get('href'))
 else:
 pass
print(urls)
words=0
for url in urls:
 if url not in ["NULL", "_blank", "None", None, "NoneType"]:
 if url[0] == "/":
 url=url[1:]
 if base_url in url:
 if base_url == url:
 pass
 if base_url != url and "https://"in url:
 url=url[len(base_url)-1:]
 if "http://" in url:
 specific_url=url
 elif "https://" in url:
 specific_url = url
 else:
 specific_url = base_url + url
 r = requests.get(specific_url)
 soup = BeautifulSoup(r.text, features="html.parser")
 for script in soup(["script", "style"]):
 script.extract()
 text = soup.get_text()
 lines = (line.strip() for line in text.splitlines())
 chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
 text = '\n'.join(chunk for chunk in chunks if chunk)
 text_list = text.split()
 print(f"{specific_url}: {len(text_list)} words")
 words += len(text_list)
 else:
 pass
print(f"Total: {words} words")

Question 2

How resilient/correct of a program are you aiming for? The main difficulty I see is finding and validating all the URLs. Also, what do you mean by domain, exactly?

Question 3

I might have chosen the wrong term; the idea is to scan a whole site, find all the htm/html pages on it, and count the words in each.

Question 4

the idea is to scan a whole site Your code appears to be parsing the links found on a single page, though. all the htm/html pages on it Can you be more specific still? count the words in each The issue of defining what a word is still stands.

Question 5

It's readable and gets the job done as a script, but there is some redundancy in your if cases. Here's a streamlined version:

import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup
base_url = "https://independencerpgs.com/"
html = urlopen(base_url)
bsObj = BeautifulSoup(html.read(), features="html.parser");
# Sets handle duplicate items automatically
urls=set()
for link in bsObj.find_all('a'):
 urls.add(link.get('href'))
print(urls)
words=0
for url in urls:
 # Having the logic this way saves 4 spaces indent on all the logic
 if url in ["NULL", "_blank", "None", None, "NoneType", base_url]:
 continue
 # The == base_url case is handled in the above `if`
 if url[0] == "/":
 specific_url = base_url + url # requests.get does not care about the num of '/'
 else:
 specific_url = url
 r = requests.get(specific_url)
 soup = BeautifulSoup(r.text, features="html.parser")
 for script in soup(["script", "style"]):
 # Use clear rather than extract
 script.clear()
 # text is text you don't need to preprocess it just yet.
 text = soup.get_text()
 print(f"{specific_url}: {len(text)} words")
 words += len(text)
print(f"Total: {words} words")

Depending on what you intend to do with that code, you might want to put it into functions or even create a class around it too.

Question 6

Thanks! Will probably include it in a function once I create a UI for it, where the user enters the URL rather than have it coded into the script itself.

Question 7

Looking at the "words += len(text)" statement, this script counts characters, not words. You need "words += len(text.split())". The OP has this correctly.

Cal Cal 3061 silver badge2 bronze badges · Answer 1 · 2020-03-12 23:07:49Z

It's readable and gets the job done as a script, but there is some redundancy in your if cases. Here's a streamlined version:

import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup
base_url = "https://independencerpgs.com/"
html = urlopen(base_url)
bsObj = BeautifulSoup(html.read(), features="html.parser");
# Sets handle duplicate items automatically
urls=set()
for link in bsObj.find_all('a'):
 urls.add(link.get('href'))
print(urls)
words=0
for url in urls:
 # Having the logic this way saves 4 spaces indent on all the logic
 if url in ["NULL", "_blank", "None", None, "NoneType", base_url]:
 continue
 # The == base_url case is handled in the above `if`
 if url[0] == "/":
 specific_url = base_url + url # requests.get does not care about the num of '/'
 else:
 specific_url = url
 r = requests.get(specific_url)
 soup = BeautifulSoup(r.text, features="html.parser")
 for script in soup(["script", "style"]):
 # Use clear rather than extract
 script.clear()
 # text is text you don't need to preprocess it just yet.
 text = soup.get_text()
 print(f"{specific_url}: {len(text)} words")
 words += len(text)
print(f"Total: {words} words")

Depending on what you intend to do with that code, you might want to put it into functions or even create a class around it too.

Thanks! Will probably include it in a function once I create a UI for it, where the user enters the URL rather than have it coded into the script itself.
Looking at the "words += len(text)" statement, this script counts characters, not words. You need "words += len(text.split())". The OP has this correctly.

Stack Exchange Network

Python web crawler counting words in all pages in a domain

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python web crawler counting words in all pages in a domain

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions