Python project to scrape webpages and build text datasets for ML purposes

Question 1

Context

I'm returning to Python development after a break and actively learning AI/ML. As part of my learning journey, I'm building a dozen pet projects to strengthen my skills. This is my first project: textnano - a minimal text dataset builder inspired by lazynlp.

Purpose: Educational tool for ML students who want to quickly create clean text datasets from various sources (Wikipedia, Reddit, Project Gutenberg, etc.) without dealing with complex dependencies.

Next steps: I plan to build a similar library for crawling audio data, with the end goal of training a voice cloning model from scratch.

Project Repository: https://github.com/Rustem/textnano

What I'm Looking For

I would appreciate feedback on:

Code organization and structure
Python best practices and idioms
Error handling and edge cases
Function API design and usability
Performance considerations
Any security concerns with web scraping

Key Design Principles

Zero dependencies - Uses only Python standard library
Simple API - Easy for beginners to understand
Educational focus - Code should be readable and well-commented
Lightweight - ~200 lines of code total

Installation:

# Install from source
pip install -e .
# Or install from PyPI (when published)
pip install textnano

Usage:

# Wikipedia (requires wikiextractor preprocessing)
# 1. Install wikiextractor: pip install wikiextractor
# 2. Extract from dump: python -m wikiextractor.WikiExtractor enwiki-latest.xml.bz2 --json -o wiki_json/
# 3. Extract URLs:
textnano wikipedia wiki_json/ --output wikipedia_urls.txt --max 10000
# 4. Build dataset:
textnano urls wikipedia_urls.txt wiki_dataset/
# Reddit (from pre-extracted URL files)
# 1. Download from: https://drive.google.com/file/d/1hRtA3zZ0K5UHKOQ0_8d0BIc_1VyxgY51/view
# 2. Extract and merge URLs:
textnano reddit reddit_urls/ --output reddit_urls.txt --max 5000
# 3. Build dataset:
textnano urls reddit_urls.txt reddit_dataset/
# Project Gutenberg
# 1. Generate URLs (checks each book ID):
textnano gutenberg --output gutenberg_urls.txt --max-id 1000
# 2. Build dataset:
textnano urls gutenberg_urls.txt books_dataset/

Please review core.py:

#!/usr/bin/env python3
"""
textnano.py - Minimal text dataset builder (nano lazynlp)
A single-file library to build text datasets from web URLs.
Perfect for ML students who just want clean text quickly.
Usage:
 python textnano.py urls.txt output/
Or in code:
 import textnano
 textnano.download_and_clean('urls.txt', 'output/')
Dependencies: ZERO (pure Python stdlib)
Lines of code: ~200
"""
import os
import re
import html
import urllib.request
import hashlib
import ssl
from pathlib import Path
from .config import DEFAULT_EXCLUDE_DOMAINS, DEFAULT_EXCLUDE_EXTENSIONS
from .utils import print_stats, estimate_dataset_size, merge_datasets
# =============================================================================
# DOWNLOAD
# =============================================================================
def download_text(url, timeout=30):
 """Download and extract text from a URL.
 Returns:
 str or None: Cleaned text content, or None if failed
 """
 try:
 # Download
 headers = {'User-Agent': 'Mozilla/5.0'}
 req = urllib.request.Request(url, headers=headers)
 # Create SSL context that doesn't verify certificates
 context = ssl.create_default_context()
 context.check_hostname = False
 context.verify_mode = ssl.CERT_NONE
 with urllib.request.urlopen(req, timeout=timeout, context=context) as response:
 content = response.read().decode('utf-8', errors='ignore')
 # Basic HTML cleaning
 text = clean_html(content)
 return text if text.strip() else None
 except Exception:
 return None
# =============================================================================
# CLEANING
# =============================================================================
def clean_html(html_content):
 """Remove HTML tags and clean text.
 Args:
 html_content: Raw HTML string
 Returns:
 str: Clean text
 """
 # Unescape HTML entities
 text = html.unescape(html_content)
 # Remove script and style tags
 text = re.sub(r'<script[^>]*>.*?</script>', '', text, flags=re.DOTALL | re.IGNORECASE)
 text = re.sub(r'<style[^>]*>.*?</style>', '', text, flags=re.DOTALL | re.IGNORECASE)
 # Remove HTML tags
 text = re.sub(r'<[^>]+>', '', text)
 # Normalize whitespace
 text = re.sub(r'\s+', ' ', text)
 text = re.sub(r'\n\s*\n', '\n\n', text)
 # Remove leading/trailing whitespace
 text = text.strip()
 return text
# =============================================================================
# DEDUPLICATION
# =============================================================================
def text_fingerprint(text, n=8):
 """Create fingerprint of text using first N words.
 Args:
 text: Input text
 n: Number of words to use (default: 8)
 Returns:
 str: MD5 hash of first N words
 """
 words = text.lower().split()[:n]
 fingerprint_text = ' '.join(words)
 return hashlib.md5(fingerprint_text.encode()).hexdigest()
def is_duplicate(text, seen_fingerprints, threshold=0.8):
 """Check if text is duplicate based on fingerprint.
 Args:
 text: Text to check
 seen_fingerprints: Set of seen fingerprints
 threshold: Not used in this simple version
 Returns:
 bool: True if duplicate
 """
 fp = text_fingerprint(text)
 if fp in seen_fingerprints:
 return True
 seen_fingerprints.add(fp)
 return False
# =============================================================================
# MAIN PIPELINE
# =============================================================================
def download_and_clean(url_file, output_dir, min_words=50, max_urls=None,
 exclude_domains=None, exclude_extensions=None,
 use_default_excludes=True):
 """Download text from URLs, clean, and deduplicate.
 Args:
 url_file: Path to file with one URL per line
 output_dir: Directory to save text files
 min_words: Minimum words per document (default: 50)
 max_urls: Maximum URLs to process (default: None = all)
 exclude_domains: List of domains to exclude (default: None, uses defaults if use_default_excludes=True)
 exclude_extensions: List of file extensions to exclude (default: None, uses defaults if use_default_excludes=True)
 use_default_excludes: Use default exclusion lists (default: True)
 Output structure:
 output_dir/
 ├── 0001.txt # Text files
 ├── 0002.txt
 ├── success.txt # Successfully processed URLs
 └── failed.txt # Failed URLs
 Returns:
 dict: Statistics {success: int, failed: int, duplicates: int}
 """
 # Setup
 os.makedirs(output_dir, exist_ok=True)
 # Normalize filters
 if use_default_excludes:
 exclude_domains = set(exclude_domains or []) | set(DEFAULT_EXCLUDE_DOMAINS)
 exclude_extensions = set(ext.lower().lstrip('.') for ext in (exclude_extensions or [])) | set(DEFAULT_EXCLUDE_EXTENSIONS)
 else:
 exclude_domains = set(exclude_domains or [])
 exclude_extensions = set(ext.lower().lstrip('.') for ext in (exclude_extensions or []))
 # Read URLs
 with open(url_file) as f:
 urls = [line.strip() for line in f if line.strip()]
 if max_urls:
 urls = urls[:max_urls]
 # Open log files
 success_log = open(os.path.join(output_dir, 'success.txt'), 'w')
 failed_log = open(os.path.join(output_dir, 'failed.txt'), 'w')
 # Deduplication
 seen_fingerprints = set()
 # Counters
 stats = {'success': 0, 'failed': 0, 'duplicates': 0, 'too_short': 0, 'excluded': 0}
 # Process each URL
 print(f"Processing {len(urls)} URLs...")
 for idx, url in enumerate(urls, 1):
 print(f"[{idx}/{len(urls)}] {url[:60]}...")
 # Check exclusion filters
 from urllib.parse import urlparse
 parsed = urlparse(url)
 # Check domain exclusion
 if exclude_domains and any(domain in parsed.netloc for domain in exclude_domains):
 failed_log.write(f"{url}\texcluded_domain\n")
 stats['excluded'] += 1
 print(" ⊘ Excluded domain")
 continue
 # Check extension exclusion
 if exclude_extensions:
 path_lower = parsed.path.lower()
 if any(path_lower.endswith(f'.{ext}') for ext in exclude_extensions):
 failed_log.write(f"{url}\texcluded_extension\n")
 stats['excluded'] += 1
 print(" ⊘ Excluded extension")
 continue
 # Download
 text = download_text(url)
 if not text:
 failed_log.write(f"{url}\n")
 stats['failed'] += 1
 print(" ✗ Failed to download")
 continue
 # Check length
 word_count = len(text.split())
 if word_count < min_words:
 failed_log.write(f"{url}\ttoo_short:{word_count}\n")
 stats['too_short'] += 1
 print(f" ⊘ Too short ({word_count} words)")
 continue
 # Check duplicate
 if is_duplicate(text, seen_fingerprints):
 stats['duplicates'] += 1
 print(" ⊘ Duplicate")
 continue
 # Save
 output_file = os.path.join(output_dir, f"{stats['success']+1:04d}.txt")
 with open(output_file, 'w') as f:
 f.write(f"{url}\n\n") # First line = URL
 f.write(text)
 success_log.write(f"{url}\n")
 stats['success'] += 1
 print(f" ✓ Saved ({word_count} words)")
 # Cleanup
 success_log.close()
 failed_log.close()
 # Print summary
 print_stats(stats)
 return stats
# =============================================================================
# CLI
# =============================================================================
def main():
 """Command-line interface."""
 import sys
 import argparse
 # Check for simple commands (backward compatibility)
 if len(sys.argv) >= 2 and sys.argv[1] == 'stats':
 if len(sys.argv) < 3:
 print("Usage: textnano stats <dir>")
 sys.exit(1)
 stats = estimate_dataset_size(sys.argv[2])
 print(f"Files: {stats['files']}")
 print(f"Words: {stats['words']:,}")
 print(f"Size: {stats['mb']:.1f} MB")
 print(f"Avg/file: {stats['avg_words_per_file']} words")
 return
 if len(sys.argv) >= 2 and sys.argv[1] == 'merge':
 if len(sys.argv) < 4:
 print("Usage: textnano merge <dir1> <dir2> ... <output_dir>")
 sys.exit(1)
 output = sys.argv[-1]
 inputs = sys.argv[2:-1]
 merge_datasets(*inputs, output_dir=output, is_duplicate_func=is_duplicate)
 return
 # Parse arguments
 parser = argparse.ArgumentParser(
 description='textnano - Minimal text dataset builder',
 formatter_class=argparse.RawDescriptionHelpFormatter
 )
 parser.add_argument('url_file', help='File with URLs (one per line)')
 parser.add_argument('output_dir', help='Output directory')
 parser.add_argument('max_urls', nargs='?', type=int, default=None,
 help='Maximum URLs to process')
 parser.add_argument('--exclude-domains', '-ed', nargs='+',
 help='Additional domains to exclude (adds to defaults)')
 parser.add_argument('--exclude-extensions', '-ee', nargs='+',
 help='Additional file extensions to exclude (adds to defaults)')
 parser.add_argument('--no-default-excludes', action='store_true',
 help='Disable default exclusion lists (only use custom excludes)')
 args = parser.parse_args()
 # Download command
 stats = download_and_clean(
 args.url_file,
 args.output_dir,
 max_urls=args.max_urls,
 exclude_domains=args.exclude_domains,
 exclude_extensions=args.exclude_extensions,
 use_default_excludes=not args.no_default_excludes
 )
 # Show dataset stats
 dataset_stats = estimate_dataset_size(args.output_dir)
 print(f"\nDataset: {dataset_stats['files']} files, "
 f"{dataset_stats['words']:,} words, "
 f"{dataset_stats['mb']:.1f} MB")
if __name__ == '__main__':
 main()
# =============================================================================
# USAGE EXAMPLES (copy these to test)
# =============================================================================
"""
# Example 1: Basic usage
python textnano.py urls.txt dataset/
# Example 2: Limit to 100 URLs
python textnano.py urls.txt dataset/ 100
# Example 3: In Python
import textnano
textnano.download_and_clean('urls.txt', 'output/')
stats = textnano.estimate_dataset_size('output/')
print(f"Got {stats['words']:,} words")
# Example 4: Create sample URLs file
cat > urls.txt << EOF
https://en.wikipedia.org/wiki/Machine_learning
https://en.wikipedia.org/wiki/Deep_learning
https://en.wikipedia.org/wiki/Natural_language_processing
https://en.wikipedia.org/wiki/Computer_vision
https://www.gutenberg.org/files/1342/1342-h/1342-h.htm
EOF
# Example 5: Get stats
python textnano.py stats dataset/
# Example 6: Merge datasets
python textnano.py merge dataset1/ dataset2/ merged/
"""

Question 2

Not sure if it is in the scope of code review, but using non-descriptive user agent and not following robots.txt is bad behavior.

Question 3

Addressed most of the feedback

Question 4

Thank you everyone!

Question 5

@RustemK: Since you found the answers helpful, feel free to Accept one of them.

Question 6

comments

Lines of code: ~200

I'm sure that was true when you wrote it. Looks like you've added a hundred or so lines since then. Comments tend to bit rot as the codebase evolves, which is why we might be reluctant to put too many specifics into them.

deps

Dependencies: ZERO

This is certainly true. I get it that we had a design goal. I'm not sure I agree with the goal. It's the rare project that doesn't need its own .venv/

Someone who uses this will likely want to pull in the good old familiar requests package before long. And I can't imagine why you wouldn't want some help with throttling request rate or obeying /robots.txt, since they're essential and they are not this project's core concerns. It's just table stakes -- a big crawler has to be a good netizen.

I get the sense that you wished to avoid depending upon lazynlp. But I wish that you had. More on that below. Then we would have an up-to-date TLDextract dep, plus justext, which I think is the big thing you wanted to evict from the deps and which seems a nice enough library to me. BTW, though it's not published on pypi, you can still use a GitHub repo URL to depend on lazynlp. You can even bake in a particular immutable commit hash.

pypi

Recommend you publish version 0.1.0 sooner than later, to reserve the project name.

Recommend that you git rm setup.py. In the modern packaging ecosystem I thought we try to simply rely on the pyproject.toml config, which looks good to me. Keeping redundant boilerplate in two files instead of one seems undesirable.

ASCII art

Recommend you avoid long decorative ========== lines. Plus, most of those comments are redundant with the (well chosen) function names.

The only place where they're helpfully organizing things is for dedup. Recommend that you take advantage of the language's ability to organize code concepts, by evicting text_fingerprint() and is_duplicate() to a new dedup.py module.

BTW thank you, the utils.py module looks well organized.

type annotation

def download_text(url, timeout=30):
 """Download and extract text from a URL.
 Returns:
 str or None: Cleaned text content, or None if failed
 """

Rather than offer some "str or None" narrative text for humans to read, prefer to put -> str | None: in the signature. Then everyone can read it, including mypy and pyrefly.

Down in clean_html() it's enough to say we accept and return str values.

design of Public API

It's unclear why return None is an advantage to the caller. Sounds like it's just one more thing to check, one more possible pitfall.

The single call site already does essentially:

 if text is None or len(text)==0: # report download failure

But of course an URL could plausibly give an empty 200 Success document. I'm just not seeing how distinguishing None from "" helps us.

request pool

We don't appear to be holding open a port 443 connection when e.g. we ask Gutenberg for Tom Sawyer and then for Huck Finn. Given that this project is fundamentally about interacting with web servers, I have to disagree with your decision to jettison the familiar and helpful requests package. The only reason it's not in Batteries Included is it needs to release more often than annual interpreter releases, given how Internet protocols and conditions keep changing so often.

Creating lots of SSL contexts seems needlessly painful and nitty gritty. Just pass in a verify=False parameter and let the network library sweat those details.

Consider using import protego for help with robots.txt and request rate.

async

Ship version 0.1.0 to pypi using the current design. But consider relying on import httpx in a subsequent iteration, so you can have e.g. a connection to Reddit and a connection to Gutenberg getting useful download work done concurrently.

error handler

 except Exception:
 return None

Consider logging elapsed time, url, and the numeric 400- or 500-status that came back. For example, we may want to know the site throttled us due to too many requests.

slop comments

 # Unescape HTML entities
 ...
 # Remove script and style tags
 ...
 # Remove HTML tags
 ...
 # Normalize whitespace
 ...
 # Remove leading/trailing whitespace

Yeah, yeah, we were vibing with an LLM, I get it. But those remarks are vacuous and are redundant with what the source code assignments eloquently state. Delete such remarks prior to a git commit.

unit tests

I didn't see any. Automated tests should be exercising those regexes, to verify they behave as you think they behave.

fingerprints

Summarizing a blog post or a novel by just its eight initial words is, ummmm, surprising. I'm especially worried that boilerplate website navbar, or repeated license / copyright notice, will make many documents on a given site appear "identical".

unused param

def is_duplicate(text, seen_fingerprints, threshold=0.8):

Remove the unused threshold, please. No need to have an IDE auto-complete that for some hapless app author.

one-liner

 exclude_extensions = set(ext.lower().lstrip('.') for ext in (exclude_extensions or [])) | set(DEFAULT_EXCLUDE_EXTENSIONS)

This is not at all easy to read. Prefer to let black -S *.py worry about laying out code within some reasonable line width:

 exclude_extensions = (
 set(ext.lower().lstrip('.') for ext in (exclude_extensions or []))
 ) | set(DEFAULT_EXCLUDE_EXTENSIONS)

design of Public API

Please elide the use_default_excludes parameter. Better to let caller pass in DEFAULT_EXCLUDE_DOMAINS and/or DEFAULT_EXCLUDE_EXTENSIONS.

Also, I wish we were pulling one or both of those in from lazynlp, rather than copy-n-pasting them into this project.

pathlib

 ... = open(os.path.join(output_dir, 'success.txt'), 'w')
 ... = open(os.path.join(output_dir, 'failed.txt'), 'w')

The ancient os.path module works well enough. But when authoring new code, prefer from pathlib import Path.

imports at top

 # Check exclusion filters
 from urllib.parse import urlparse
 parsed = urlparse(url)

No.

Put the import where it belongs.

Also, the comment suggests that we should Extract Helper function, in the hopes of letting the body of the for loop appear in a single screenful without vertical scrolling.

def main():
 """Command-line interface."""
 import sys
 import argparse

Again, no.

Those two belong at top of module.

import logging

 failed_log.write(f"{url}\texcluded_domain\n")

Consider doing that with a logger, so you consistently get automatic timestamped log entries. That will help you understand why a given crawl was fast or slow.

cracking argv

 if len(sys.argv) >= 2 and sys.argv[1] == 'stats':
 if len(sys.argv) < 3:
 ...
 if len(sys.argv) >= 2 and sys.argv[1] == 'merge':
 if len(sys.argv) < 4:

I imagine those work properly? But it seems like you're working too hard. Why didn't argparse deal with optional items for you already? When I import typer I always get appropriate CLI diagnostics displayed automatically, without jumping through such hoops.

be lazy

After you (quickly) ship version 0.1.0, I urge you to consider using uv to manage dependencies listed in pyproject.toml, such as httpx. Consider adding a make install Makefile, or a shell script, that shows how to pull in deps and assemble a small text corpus.

This project should focus on its core value-add, which is managing large text datasets. To the extent that you can outsource any of the network minutiae to some well tested library that has already worked out the details, I encourage you to do so.

Question 7

added logging; switched to argparse.

Question 8

added parallel implementation with httpx

Question 9

"remarks are vacuous and are redundant with what the source code assignments eloquently state" I disagree. The source code doesn't actually do what the comments say, so the comments are the way to know the current behavior is a BUG not intended.

Question 10

You have docstrings and comments that are very useful. Your code also seems to be very well organized and structured. The API is simple enough. My only questions/suggestions are:

Security Concerns

You specify:

 # Create SSL context that doesn't verify certificates
 context = ssl.create_default_context()
 context.check_hostname = False
 context.verify_mode = ssl.CERT_NONE

This exposes you to man-in-the-middle attacks. You don't care?

Performance

Function download_and_clean reads in a number of URLs and proceeds to download each URL serially. Performance could be greatly enhanced if you used a multithreading pool (either class multiprocessing.pool.ThreadPool or concurrent.futures.ThreadPoolExecutor) to download the URLs more concurrently (N downloads running concurrently where you can choose the value of N).

Edge Cases

In your desire to only depend on the standard library you have implemented function clean_html to "remove HTML tags and clean text." I can only assume your purpose is to extract from the HTML the text, e.g. the textual contents of a <p> tag. But this cannot be done correctly without using an actual HTML parser, which is not provided by the standard library.

The first thing you do is html.unescape the text. The user may have in part:

<p>
You should not use the &lt;H1&gt; tag. For example, &lt;H1&gt;Some Title&lt;/H1&gt;
<p>

After un-escaping you end up with:

<p>
You should not use the <H1> tag. For example, <H1>Some Title</H1>
<p>

But then you execute on this result text = re.sub(r'<[^>]+>', '', text), which produces:

You should not use the tag. For example, Some Title

This is probably not what you would want. If you were to first remove tags and then un-escape the results, it would be an improvement -- but far from perfect.

Function `test_fingerprint`

Why create a digest of only the first 8 "words" (i.e. tokens created by splitting on whitespace)? Is it too "costly" to create a digest on the entire text? You do recover some processing time on not having to split the text. Thanks to Stef for making the point that it is not necessary to split the entire text just to get the first N words:

 words = text.lower().split(maxsplit=n)[:n]

Function `download_and_clean`

You have:

 # Normalize filters
 if use_default_excludes:
 exclude_domains = set(exclude_domains or []) | set(DEFAULT_EXCLUDE_DOMAINS)
 exclude_extensions = set(ext.lower().lstrip('.') for ext in (exclude_extensions or [])) | set(DEFAULT_EXCLUDE_EXTENSIONS)
 else:
 exclude_domains = set(exclude_domains or [])
 exclude_extensions = set(ext.lower().lstrip('.') for ext in (exclude_extensions or []))

Based on the DRY Principle this should be expressed as:

 # Normalize filters
 exclude_domains = set(exclude_domains or [])
 exclude_extensions = set(ext.lower().lstrip('.') for ext in (exclude_extensions or []))
 if use_default_excludes:
 exclude_domains |= set(DEFAULT_EXCLUDE_DOMAINS)
 exclude_extensions |= set(DEFAULT_EXCLUDE_EXTENSIONS)

Later on you have:

 # Open log files
 success_log = open(os.path.join(output_dir, 'success.txt'), 'w')
 failed_log = open(os.path.join(output_dir, 'failed.txt'), 'w')
 ...
 # Cleanup
 success_log.close()
 failed_log.close()

You have been using context managers to ensure files are closed properly even after exceptions. Why not here?

 with open(os.path.join(output_dir, 'success.txt'), 'w') as success_log, \
 open(os.path.join(output_dir, 'failed.txt'), 'w') as failed_log:
 ...

This function has:

 # Print summary
 print_stats(stats)
 return stats

Your main function calls this function but does not do anything with the returned value. Do you really need to both print and return stats?

Question 11

Regarding test_fingerprint: words = text.lower().split()[:n] can be replaced with words = text.lower().split(maxsplit=n)[:n] to avoid splitting the whole text when only the first few words are wanted

Question 12

@Stef Good point! Thanks.

Question 13

Although I don't actually know how much this actually saves - this still cuts the text in n+1 strings, and presumably the last string, which we don't need, will be very long and I don't know whether it's copied or if python is smart enough to use the same underlying char array as the original text.

Question 14

Another evidence in favor of "cannot be done correctly without using an actual HTML parser" is what the code currently does if there are multiple script or multiple style tags, not contiguous.

Question 15

Overview

The code layout is good, and you added ample documentation with usage examples.

try/except

In the download_text function, the except statements are many lines away from the try lines.
PEP 8 recommends that you limit the try clause to the absolute minimum amount of code necessary to avoid masking bugs. It is hard to keep track of what line (or lines) are expected to result in the exception.

Import

The ruff tool identifies this line as unused:

from pathlib import Path

It can be deleted.

Consider moving the import line from the download_and_clean function to the top of the code:

from urllib.parse import urlparse

Documentation

It is great that you have docstrings for your functions, as recommended by the PEP 8 style guide.

Also consider using type hints to describe input and return types for the functions to make the code more self-documenting.

Command line

It seems redundant to use both argv and argparse. I see the comment about backward compatibility, but I think you should try to only use argparse.

Portability

I'm not a big fan of fancy Unicode characters in source code, like the characters in the download_and_clean function docstring. Sometimes they don't render well in editors, and other times they don't render well in output generated by the code.

Question 16

added type hints. Added better exception handling.

Question 17

@RustemK: I'm happy this was of use.

Question 18

Returning `None`

If we look at download_text there is an opportunity.

def download_text(url, timeout=30):
 """Download and extract text from a URL.
 Returns:
 str or None: Cleaned text content, or None if failed
 """
 try:
 # Download
 headers = {'User-Agent': 'Mozilla/5.0'}
 req = urllib.request.Request(url, headers=headers)
 # Create SSL context that doesn't verify certificates
 context = ssl.create_default_context()
 context.check_hostname = False
 context.verify_mode = ssl.CERT_NONE
 with urllib.request.urlopen(req, timeout=timeout, context=context) as response:
 content = response.read().decode('utf-8', errors='ignore')
 # Basic HTML cleaning
 text = clean_html(content)
 return text if text.strip() else None
 except Exception:
 return None

When control flow hits the end of a Python function without hitting an explicit return of some value, None is returned. Thus your code can be:

def download_text(url, timeout=30):
 """Download and extract text from a URL.
 Returns:
 str or None: Cleaned text content, or None if failed
 """
 try:
 # Download
 headers = {'User-Agent': 'Mozilla/5.0'}
 req = urllib.request.Request(url, headers=headers)
 # Create SSL context that doesn't verify certificates
 context = ssl.create_default_context()
 context.check_hostname = False
 context.verify_mode = ssl.CERT_NONE
 with urllib.request.urlopen(req, timeout=timeout, context=context) as response:
 content = response.read().decode('utf-8', errors='ignore')
 # Basic HTML cleaning
 text = clean_html(content)
 if text.strip():
 return text
 except Exception:
 pass

J_H 42.3k3 gold badges38 silver badges157 bronze badges · Answer 1 · 2025-10-29 18:32:56Z

comments

Lines of code: ~200

I'm sure that was true when you wrote it. Looks like you've added a hundred or so lines since then. Comments tend to bit rot as the codebase evolves, which is why we might be reluctant to put too many specifics into them.

deps

Dependencies: ZERO

This is certainly true. I get it that we had a design goal. I'm not sure I agree with the goal. It's the rare project that doesn't need its own .venv/

Someone who uses this will likely want to pull in the good old familiar requests package before long. And I can't imagine why you wouldn't want some help with throttling request rate or obeying /robots.txt, since they're essential and they are not this project's core concerns. It's just table stakes -- a big crawler has to be a good netizen.

I get the sense that you wished to avoid depending upon lazynlp. But I wish that you had. More on that below. Then we would have an up-to-date TLDextract dep, plus justext, which I think is the big thing you wanted to evict from the deps and which seems a nice enough library to me. BTW, though it's not published on pypi, you can still use a GitHub repo URL to depend on lazynlp. You can even bake in a particular immutable commit hash.

pypi

Recommend you publish version 0.1.0 sooner than later, to reserve the project name.

Recommend that you git rm setup.py. In the modern packaging ecosystem I thought we try to simply rely on the pyproject.toml config, which looks good to me. Keeping redundant boilerplate in two files instead of one seems undesirable.

ASCII art

Recommend you avoid long decorative ========== lines. Plus, most of those comments are redundant with the (well chosen) function names.

The only place where they're helpfully organizing things is for dedup. Recommend that you take advantage of the language's ability to organize code concepts, by evicting text_fingerprint() and is_duplicate() to a new dedup.py module.

BTW thank you, the utils.py module looks well organized.

type annotation

def download_text(url, timeout=30):
 """Download and extract text from a URL.
 Returns:
 str or None: Cleaned text content, or None if failed
 """

Rather than offer some "str or None" narrative text for humans to read, prefer to put -> str | None: in the signature. Then everyone can read it, including mypy and pyrefly.

Down in clean_html() it's enough to say we accept and return str values.

design of Public API

It's unclear why return None is an advantage to the caller. Sounds like it's just one more thing to check, one more possible pitfall.

The single call site already does essentially:

 if text is None or len(text)==0: # report download failure

But of course an URL could plausibly give an empty 200 Success document. I'm just not seeing how distinguishing None from "" helps us.

request pool

We don't appear to be holding open a port 443 connection when e.g. we ask Gutenberg for Tom Sawyer and then for Huck Finn. Given that this project is fundamentally about interacting with web servers, I have to disagree with your decision to jettison the familiar and helpful requests package. The only reason it's not in Batteries Included is it needs to release more often than annual interpreter releases, given how Internet protocols and conditions keep changing so often.

Creating lots of SSL contexts seems needlessly painful and nitty gritty. Just pass in a verify=False parameter and let the network library sweat those details.

Consider using import protego for help with robots.txt and request rate.

async

Ship version 0.1.0 to pypi using the current design. But consider relying on import httpx in a subsequent iteration, so you can have e.g. a connection to Reddit and a connection to Gutenberg getting useful download work done concurrently.

error handler

 except Exception:
 return None

Consider logging elapsed time, url, and the numeric 400- or 500-status that came back. For example, we may want to know the site throttled us due to too many requests.

slop comments

 # Unescape HTML entities
 ...
 # Remove script and style tags
 ...
 # Remove HTML tags
 ...
 # Normalize whitespace
 ...
 # Remove leading/trailing whitespace

Yeah, yeah, we were vibing with an LLM, I get it. But those remarks are vacuous and are redundant with what the source code assignments eloquently state. Delete such remarks prior to a git commit.

unit tests

I didn't see any. Automated tests should be exercising those regexes, to verify they behave as you think they behave.

fingerprints

Summarizing a blog post or a novel by just its eight initial words is, ummmm, surprising. I'm especially worried that boilerplate website navbar, or repeated license / copyright notice, will make many documents on a given site appear "identical".

unused param

def is_duplicate(text, seen_fingerprints, threshold=0.8):

Remove the unused threshold, please. No need to have an IDE auto-complete that for some hapless app author.

one-liner

 exclude_extensions = set(ext.lower().lstrip('.') for ext in (exclude_extensions or [])) | set(DEFAULT_EXCLUDE_EXTENSIONS)

This is not at all easy to read. Prefer to let black -S *.py worry about laying out code within some reasonable line width:

 exclude_extensions = (
 set(ext.lower().lstrip('.') for ext in (exclude_extensions or []))
 ) | set(DEFAULT_EXCLUDE_EXTENSIONS)

design of Public API

Please elide the use_default_excludes parameter. Better to let caller pass in DEFAULT_EXCLUDE_DOMAINS and/or DEFAULT_EXCLUDE_EXTENSIONS.

Also, I wish we were pulling one or both of those in from lazynlp, rather than copy-n-pasting them into this project.

pathlib

 ... = open(os.path.join(output_dir, 'success.txt'), 'w')
 ... = open(os.path.join(output_dir, 'failed.txt'), 'w')

The ancient os.path module works well enough. But when authoring new code, prefer from pathlib import Path.

imports at top

 # Check exclusion filters
 from urllib.parse import urlparse
 parsed = urlparse(url)

No.

Put the import where it belongs.

Also, the comment suggests that we should Extract Helper function, in the hopes of letting the body of the for loop appear in a single screenful without vertical scrolling.

def main():
 """Command-line interface."""
 import sys
 import argparse

Again, no.

Those two belong at top of module.

import logging

 failed_log.write(f"{url}\texcluded_domain\n")

Consider doing that with a logger, so you consistently get automatic timestamped log entries. That will help you understand why a given crawl was fast or slow.

cracking argv

 if len(sys.argv) >= 2 and sys.argv[1] == 'stats':
 if len(sys.argv) < 3:
 ...
 if len(sys.argv) >= 2 and sys.argv[1] == 'merge':
 if len(sys.argv) < 4:

I imagine those work properly? But it seems like you're working too hard. Why didn't argparse deal with optional items for you already? When I import typer I always get appropriate CLI diagnostics displayed automatically, without jumping through such hoops.

be lazy

After you (quickly) ship version 0.1.0, I urge you to consider using uv to manage dependencies listed in pyproject.toml, such as httpx. Consider adding a make install Makefile, or a shell script, that shows how to pull in deps and assemble a small text corpus.

This project should focus on its core value-add, which is managing large text datasets. To the extent that you can outsource any of the network minutiae to some well tested library that has already worked out the details, I encourage you to do so.

"remarks are vacuous and are redundant with what the source code assignments eloquently state" I disagree. The source code doesn't actually do what the comments say, so the comments are the way to know the current behavior is a BUG not intended.

Booboo 3,6664 silver badges15 bronze badges · Answer 2 · 2025-10-29 11:49:23Z

You have docstrings and comments that are very useful. Your code also seems to be very well organized and structured. The API is simple enough. My only questions/suggestions are:

Security Concerns

You specify:

 # Create SSL context that doesn't verify certificates
 context = ssl.create_default_context()
 context.check_hostname = False
 context.verify_mode = ssl.CERT_NONE

This exposes you to man-in-the-middle attacks. You don't care?

Performance

Function download_and_clean reads in a number of URLs and proceeds to download each URL serially. Performance could be greatly enhanced if you used a multithreading pool (either class multiprocessing.pool.ThreadPool or concurrent.futures.ThreadPoolExecutor) to download the URLs more concurrently (N downloads running concurrently where you can choose the value of N).

Edge Cases

In your desire to only depend on the standard library you have implemented function clean_html to "remove HTML tags and clean text." I can only assume your purpose is to extract from the HTML the text, e.g. the textual contents of a <p> tag. But this cannot be done correctly without using an actual HTML parser, which is not provided by the standard library.

The first thing you do is html.unescape the text. The user may have in part:

<p>
You should not use the &lt;H1&gt; tag. For example, &lt;H1&gt;Some Title&lt;/H1&gt;
<p>

After un-escaping you end up with:

<p>
You should not use the <H1> tag. For example, <H1>Some Title</H1>
<p>

But then you execute on this result text = re.sub(r'<[^>]+>', '', text), which produces:

You should not use the tag. For example, Some Title

This is probably not what you would want. If you were to first remove tags and then un-escape the results, it would be an improvement -- but far from perfect.

Function `test_fingerprint`

Why create a digest of only the first 8 "words" (i.e. tokens created by splitting on whitespace)? Is it too "costly" to create a digest on the entire text? You do recover some processing time on not having to split the text. Thanks to Stef for making the point that it is not necessary to split the entire text just to get the first N words:

 words = text.lower().split(maxsplit=n)[:n]

Function `download_and_clean`

You have:

 # Normalize filters
 if use_default_excludes:
 exclude_domains = set(exclude_domains or []) | set(DEFAULT_EXCLUDE_DOMAINS)
 exclude_extensions = set(ext.lower().lstrip('.') for ext in (exclude_extensions or [])) | set(DEFAULT_EXCLUDE_EXTENSIONS)
 else:
 exclude_domains = set(exclude_domains or [])
 exclude_extensions = set(ext.lower().lstrip('.') for ext in (exclude_extensions or []))

Based on the DRY Principle this should be expressed as:

 # Normalize filters
 exclude_domains = set(exclude_domains or [])
 exclude_extensions = set(ext.lower().lstrip('.') for ext in (exclude_extensions or []))
 if use_default_excludes:
 exclude_domains |= set(DEFAULT_EXCLUDE_DOMAINS)
 exclude_extensions |= set(DEFAULT_EXCLUDE_EXTENSIONS)

Later on you have:

 # Open log files
 success_log = open(os.path.join(output_dir, 'success.txt'), 'w')
 failed_log = open(os.path.join(output_dir, 'failed.txt'), 'w')
 ...
 # Cleanup
 success_log.close()
 failed_log.close()

You have been using context managers to ensure files are closed properly even after exceptions. Why not here?

 with open(os.path.join(output_dir, 'success.txt'), 'w') as success_log, \
 open(os.path.join(output_dir, 'failed.txt'), 'w') as failed_log:
 ...

This function has:

 # Print summary
 print_stats(stats)
 return stats

Your main function calls this function but does not do anything with the returned value. Do you really need to both print and return stats?

Regarding test_fingerprint: words = text.lower().split()[:n] can be replaced with words = text.lower().split(maxsplit=n)[:n] to avoid splitting the whole text when only the first few words are wanted
Although I don't actually know how much this actually saves - this still cuts the text in n+1 strings, and presumably the last string, which we don't need, will be very long and I don't know whether it's copied or if python is smart enough to use the same underlying char array as the original text.
Another evidence in favor of "cannot be done correctly without using an actual HTML parser" is what the code currently does if there are multiple script or multiple style tags, not contiguous.

toolic 15.9k6 gold badges29 silver badges217 bronze badges · Answer 3 · 2025-10-29 12:19:09Z

Overview

The code layout is good, and you added ample documentation with usage examples.

try/except

In the download_text function, the except statements are many lines away from the try lines.
PEP 8 recommends that you limit the try clause to the absolute minimum amount of code necessary to avoid masking bugs. It is hard to keep track of what line (or lines) are expected to result in the exception.

Import

The ruff tool identifies this line as unused:

from pathlib import Path

It can be deleted.

Consider moving the import line from the download_and_clean function to the top of the code:

from urllib.parse import urlparse

Documentation

It is great that you have docstrings for your functions, as recommended by the PEP 8 style guide.

Also consider using type hints to describe input and return types for the functions to make the code more self-documenting.

Command line

It seems redundant to use both argv and argparse. I see the comment about backward compatibility, but I think you should try to only use argparse.

Portability

I'm not a big fan of fancy Unicode characters in source code, like the characters in the download_and_clean function docstring. Sometimes they don't render well in editors, and other times they don't render well in output generated by the code.

\$\begingroup\$ added type hints. Added better exception handling. \$\endgroup\$

Rustem K
– Rustem K

2025年10月30日 05:12:12 +00:00
Commented Oct 30 at 5:12
\$\begingroup\$ @RustemK: I'm happy this was of use. \$\endgroup\$

toolic
– toolic

2025年11月01日 12:52:45 +00:00
Commented Nov 1 at 12:52

Chris 4,6941 gold badge7 silver badges37 bronze badges · Answer 4 · 2025-11-01 19:32:17Z

Returning `None`

If we look at download_text there is an opportunity.

def download_text(url, timeout=30):
 """Download and extract text from a URL.
 Returns:
 str or None: Cleaned text content, or None if failed
 """
 try:
 # Download
 headers = {'User-Agent': 'Mozilla/5.0'}
 req = urllib.request.Request(url, headers=headers)
 # Create SSL context that doesn't verify certificates
 context = ssl.create_default_context()
 context.check_hostname = False
 context.verify_mode = ssl.CERT_NONE
 with urllib.request.urlopen(req, timeout=timeout, context=context) as response:
 content = response.read().decode('utf-8', errors='ignore')
 # Basic HTML cleaning
 text = clean_html(content)
 return text if text.strip() else None
 except Exception:
 return None

When control flow hits the end of a Python function without hitting an explicit return of some value, None is returned. Thus your code can be:

def download_text(url, timeout=30):
 """Download and extract text from a URL.
 Returns:
 str or None: Cleaned text content, or None if failed
 """
 try:
 # Download
 headers = {'User-Agent': 'Mozilla/5.0'}
 req = urllib.request.Request(url, headers=headers)
 # Create SSL context that doesn't verify certificates
 context = ssl.create_default_context()
 context.check_hostname = False
 context.verify_mode = ssl.CERT_NONE
 with urllib.request.urlopen(req, timeout=timeout, context=context) as response:
 content = response.read().decode('utf-8', errors='ignore')
 # Basic HTML cleaning
 text = clean_html(content)
 if text.strip():
 return text
 except Exception:
 pass

Stack Exchange Network

Python project to scrape webpages and build text datasets for ML purposes

Context

What I'm Looking For

Key Design Principles

4 Answers 4

comments

deps

pypi

ASCII art

type annotation

design of Public API

request pool

async

error handler

slop comments

unit tests

fingerprints

unused param

one-liner

design of Public API

pathlib

imports at top

import logging

cracking argv

be lazy

Security Concerns

Performance

Edge Cases

Function `test_fingerprint`

Function `download_and_clean`

Overview

try/except

Import

Documentation

Command line

Portability

Returning `None`

You must log in to answer this question.

Hot Network Questions

Python project to scrape webpages and build text datasets for ML purposes

Context

What I'm Looking For

Key Design Principles

4 Answers 4

comments

deps

pypi

ASCII art

type annotation

design of Public API

request pool

async

error handler

slop comments

unit tests

fingerprints

unused param

one-liner

design of Public API

pathlib

imports at top

import logging

cracking argv

be lazy

Security Concerns

Performance

Edge Cases

Function test_fingerprint

Function download_and_clean

Overview

try/except

Import

Documentation

Command line

Portability

Returning None

You must log in to answer this question.

Related

Hot Network Questions

Function `test_fingerprint`

Function `download_and_clean`

Returning `None`