Context
I'm returning to Python development after a break and actively learning AI/ML. As part of my learning journey, I'm building a dozen pet projects to strengthen my skills. This is my first project: textnano - a minimal text dataset builder inspired by lazynlp.
Purpose: Educational tool for ML students who want to quickly create clean text datasets from various sources (Wikipedia, Reddit, Project Gutenberg, etc.) without dealing with complex dependencies.
Next steps: I plan to build a similar library for crawling audio data, with the end goal of training a voice cloning model from scratch.
Project Repository: https://github.com/Rustem/textnano
What I'm Looking For
I would appreciate feedback on:
- Code organization and structure
- Python best practices and idioms
- Error handling and edge cases
- Function API design and usability
- Performance considerations
- Any security concerns with web scraping
Key Design Principles
- Zero dependencies - Uses only Python standard library
- Simple API - Easy for beginners to understand
- Educational focus - Code should be readable and well-commented
- Lightweight - ~200 lines of code total
Installation:
# Install from source
pip install -e .
# Or install from PyPI (when published)
pip install textnano
Usage:
# Wikipedia (requires wikiextractor preprocessing)
# 1. Install wikiextractor: pip install wikiextractor
# 2. Extract from dump: python -m wikiextractor.WikiExtractor enwiki-latest.xml.bz2 --json -o wiki_json/
# 3. Extract URLs:
textnano wikipedia wiki_json/ --output wikipedia_urls.txt --max 10000
# 4. Build dataset:
textnano urls wikipedia_urls.txt wiki_dataset/
# Reddit (from pre-extracted URL files)
# 1. Download from: https://drive.google.com/file/d/1hRtA3zZ0K5UHKOQ0_8d0BIc_1VyxgY51/view
# 2. Extract and merge URLs:
textnano reddit reddit_urls/ --output reddit_urls.txt --max 5000
# 3. Build dataset:
textnano urls reddit_urls.txt reddit_dataset/
# Project Gutenberg
# 1. Generate URLs (checks each book ID):
textnano gutenberg --output gutenberg_urls.txt --max-id 1000
# 2. Build dataset:
textnano urls gutenberg_urls.txt books_dataset/
Please review core.py:
#!/usr/bin/env python3
"""
textnano.py - Minimal text dataset builder (nano lazynlp)
A single-file library to build text datasets from web URLs.
Perfect for ML students who just want clean text quickly.
Usage:
python textnano.py urls.txt output/
Or in code:
import textnano
textnano.download_and_clean('urls.txt', 'output/')
Dependencies: ZERO (pure Python stdlib)
Lines of code: ~200
"""
import os
import re
import html
import urllib.request
import hashlib
import ssl
from pathlib import Path
from .config import DEFAULT_EXCLUDE_DOMAINS, DEFAULT_EXCLUDE_EXTENSIONS
from .utils import print_stats, estimate_dataset_size, merge_datasets
# =============================================================================
# DOWNLOAD
# =============================================================================
def download_text(url, timeout=30):
"""Download and extract text from a URL.
Returns:
str or None: Cleaned text content, or None if failed
"""
try:
# Download
headers = {'User-Agent': 'Mozilla/5.0'}
req = urllib.request.Request(url, headers=headers)
# Create SSL context that doesn't verify certificates
context = ssl.create_default_context()
context.check_hostname = False
context.verify_mode = ssl.CERT_NONE
with urllib.request.urlopen(req, timeout=timeout, context=context) as response:
content = response.read().decode('utf-8', errors='ignore')
# Basic HTML cleaning
text = clean_html(content)
return text if text.strip() else None
except Exception:
return None
# =============================================================================
# CLEANING
# =============================================================================
def clean_html(html_content):
"""Remove HTML tags and clean text.
Args:
html_content: Raw HTML string
Returns:
str: Clean text
"""
# Unescape HTML entities
text = html.unescape(html_content)
# Remove script and style tags
text = re.sub(r'<script[^>]*>.*?</script>', '', text, flags=re.DOTALL | re.IGNORECASE)
text = re.sub(r'<style[^>]*>.*?</style>', '', text, flags=re.DOTALL | re.IGNORECASE)
# Remove HTML tags
text = re.sub(r'<[^>]+>', '', text)
# Normalize whitespace
text = re.sub(r'\s+', ' ', text)
text = re.sub(r'\n\s*\n', '\n\n', text)
# Remove leading/trailing whitespace
text = text.strip()
return text
# =============================================================================
# DEDUPLICATION
# =============================================================================
def text_fingerprint(text, n=8):
"""Create fingerprint of text using first N words.
Args:
text: Input text
n: Number of words to use (default: 8)
Returns:
str: MD5 hash of first N words
"""
words = text.lower().split()[:n]
fingerprint_text = ' '.join(words)
return hashlib.md5(fingerprint_text.encode()).hexdigest()
def is_duplicate(text, seen_fingerprints, threshold=0.8):
"""Check if text is duplicate based on fingerprint.
Args:
text: Text to check
seen_fingerprints: Set of seen fingerprints
threshold: Not used in this simple version
Returns:
bool: True if duplicate
"""
fp = text_fingerprint(text)
if fp in seen_fingerprints:
return True
seen_fingerprints.add(fp)
return False
# =============================================================================
# MAIN PIPELINE
# =============================================================================
def download_and_clean(url_file, output_dir, min_words=50, max_urls=None,
exclude_domains=None, exclude_extensions=None,
use_default_excludes=True):
"""Download text from URLs, clean, and deduplicate.
Args:
url_file: Path to file with one URL per line
output_dir: Directory to save text files
min_words: Minimum words per document (default: 50)
max_urls: Maximum URLs to process (default: None = all)
exclude_domains: List of domains to exclude (default: None, uses defaults if use_default_excludes=True)
exclude_extensions: List of file extensions to exclude (default: None, uses defaults if use_default_excludes=True)
use_default_excludes: Use default exclusion lists (default: True)
Output structure:
output_dir/
├── 0001.txt # Text files
├── 0002.txt
├── success.txt # Successfully processed URLs
└── failed.txt # Failed URLs
Returns:
dict: Statistics {success: int, failed: int, duplicates: int}
"""
# Setup
os.makedirs(output_dir, exist_ok=True)
# Normalize filters
if use_default_excludes:
exclude_domains = set(exclude_domains or []) | set(DEFAULT_EXCLUDE_DOMAINS)
exclude_extensions = set(ext.lower().lstrip('.') for ext in (exclude_extensions or [])) | set(DEFAULT_EXCLUDE_EXTENSIONS)
else:
exclude_domains = set(exclude_domains or [])
exclude_extensions = set(ext.lower().lstrip('.') for ext in (exclude_extensions or []))
# Read URLs
with open(url_file) as f:
urls = [line.strip() for line in f if line.strip()]
if max_urls:
urls = urls[:max_urls]
# Open log files
success_log = open(os.path.join(output_dir, 'success.txt'), 'w')
failed_log = open(os.path.join(output_dir, 'failed.txt'), 'w')
# Deduplication
seen_fingerprints = set()
# Counters
stats = {'success': 0, 'failed': 0, 'duplicates': 0, 'too_short': 0, 'excluded': 0}
# Process each URL
print(f"Processing {len(urls)} URLs...")
for idx, url in enumerate(urls, 1):
print(f"[{idx}/{len(urls)}] {url[:60]}...")
# Check exclusion filters
from urllib.parse import urlparse
parsed = urlparse(url)
# Check domain exclusion
if exclude_domains and any(domain in parsed.netloc for domain in exclude_domains):
failed_log.write(f"{url}\texcluded_domain\n")
stats['excluded'] += 1
print(" ⊘ Excluded domain")
continue
# Check extension exclusion
if exclude_extensions:
path_lower = parsed.path.lower()
if any(path_lower.endswith(f'.{ext}') for ext in exclude_extensions):
failed_log.write(f"{url}\texcluded_extension\n")
stats['excluded'] += 1
print(" ⊘ Excluded extension")
continue
# Download
text = download_text(url)
if not text:
failed_log.write(f"{url}\n")
stats['failed'] += 1
print(" ✗ Failed to download")
continue
# Check length
word_count = len(text.split())
if word_count < min_words:
failed_log.write(f"{url}\ttoo_short:{word_count}\n")
stats['too_short'] += 1
print(f" ⊘ Too short ({word_count} words)")
continue
# Check duplicate
if is_duplicate(text, seen_fingerprints):
stats['duplicates'] += 1
print(" ⊘ Duplicate")
continue
# Save
output_file = os.path.join(output_dir, f"{stats['success']+1:04d}.txt")
with open(output_file, 'w') as f:
f.write(f"{url}\n\n") # First line = URL
f.write(text)
success_log.write(f"{url}\n")
stats['success'] += 1
print(f" ✓ Saved ({word_count} words)")
# Cleanup
success_log.close()
failed_log.close()
# Print summary
print_stats(stats)
return stats
# =============================================================================
# CLI
# =============================================================================
def main():
"""Command-line interface."""
import sys
import argparse
# Check for simple commands (backward compatibility)
if len(sys.argv) >= 2 and sys.argv[1] == 'stats':
if len(sys.argv) < 3:
print("Usage: textnano stats <dir>")
sys.exit(1)
stats = estimate_dataset_size(sys.argv[2])
print(f"Files: {stats['files']}")
print(f"Words: {stats['words']:,}")
print(f"Size: {stats['mb']:.1f} MB")
print(f"Avg/file: {stats['avg_words_per_file']} words")
return
if len(sys.argv) >= 2 and sys.argv[1] == 'merge':
if len(sys.argv) < 4:
print("Usage: textnano merge <dir1> <dir2> ... <output_dir>")
sys.exit(1)
output = sys.argv[-1]
inputs = sys.argv[2:-1]
merge_datasets(*inputs, output_dir=output, is_duplicate_func=is_duplicate)
return
# Parse arguments
parser = argparse.ArgumentParser(
description='textnano - Minimal text dataset builder',
formatter_class=argparse.RawDescriptionHelpFormatter
)
parser.add_argument('url_file', help='File with URLs (one per line)')
parser.add_argument('output_dir', help='Output directory')
parser.add_argument('max_urls', nargs='?', type=int, default=None,
help='Maximum URLs to process')
parser.add_argument('--exclude-domains', '-ed', nargs='+',
help='Additional domains to exclude (adds to defaults)')
parser.add_argument('--exclude-extensions', '-ee', nargs='+',
help='Additional file extensions to exclude (adds to defaults)')
parser.add_argument('--no-default-excludes', action='store_true',
help='Disable default exclusion lists (only use custom excludes)')
args = parser.parse_args()
# Download command
stats = download_and_clean(
args.url_file,
args.output_dir,
max_urls=args.max_urls,
exclude_domains=args.exclude_domains,
exclude_extensions=args.exclude_extensions,
use_default_excludes=not args.no_default_excludes
)
# Show dataset stats
dataset_stats = estimate_dataset_size(args.output_dir)
print(f"\nDataset: {dataset_stats['files']} files, "
f"{dataset_stats['words']:,} words, "
f"{dataset_stats['mb']:.1f} MB")
if __name__ == '__main__':
main()
# =============================================================================
# USAGE EXAMPLES (copy these to test)
# =============================================================================
"""
# Example 1: Basic usage
python textnano.py urls.txt dataset/
# Example 2: Limit to 100 URLs
python textnano.py urls.txt dataset/ 100
# Example 3: In Python
import textnano
textnano.download_and_clean('urls.txt', 'output/')
stats = textnano.estimate_dataset_size('output/')
print(f"Got {stats['words']:,} words")
# Example 4: Create sample URLs file
cat > urls.txt << EOF
https://en.wikipedia.org/wiki/Machine_learning
https://en.wikipedia.org/wiki/Deep_learning
https://en.wikipedia.org/wiki/Natural_language_processing
https://en.wikipedia.org/wiki/Computer_vision
https://www.gutenberg.org/files/1342/1342-h/1342-h.htm
EOF
# Example 5: Get stats
python textnano.py stats dataset/
# Example 6: Merge datasets
python textnano.py merge dataset1/ dataset2/ merged/
"""
-
2\$\begingroup\$ Not sure if it is in the scope of code review, but using non-descriptive user agent and not following robots.txt is bad behavior. \$\endgroup\$jpa– jpa2025年10月29日 16:46:34 +00:00Commented Oct 29 at 16:46
-
\$\begingroup\$ Addressed most of the feedback \$\endgroup\$Rustem K– Rustem K2025年10月30日 05:50:53 +00:00Commented Oct 30 at 5:50
-
\$\begingroup\$ Thank you everyone! \$\endgroup\$Rustem K– Rustem K2025年10月30日 05:52:41 +00:00Commented Oct 30 at 5:52
-
\$\begingroup\$ @RustemK: Since you found the answers helpful, feel free to Accept one of them. \$\endgroup\$toolic– toolic2025年11月01日 12:54:07 +00:00Commented Nov 1 at 12:54
4 Answers 4
comments
Lines of code: ~200
I'm sure that was true when you wrote it. Looks like you've added a hundred or so lines since then. Comments tend to bit rot as the codebase evolves, which is why we might be reluctant to put too many specifics into them.
deps
Dependencies: ZERO
This is certainly true.
I get it that we had a design goal.
I'm not sure I agree with the goal.
It's the rare project that doesn't need its own .venv/
Someone who uses this will likely want to pull in
the good old familiar requests package before long.
And I can't imagine why you wouldn't want some help
with throttling request rate or obeying /robots.txt,
since they're essential and they are not this project's core concerns.
It's just table stakes -- a big crawler has to be a good netizen.
I get the sense that you wished to avoid depending upon
lazynlp.
But I wish that you had.
More on that below.
Then we would have an up-to-date TLDextract dep, plus justext,
which I think is the big thing you wanted to evict from the deps
and which seems a nice enough library to me.
BTW, though it's not published on pypi,
you can still use a GitHub repo URL to depend on lazynlp.
You can even bake in a particular immutable commit hash.
pypi
Recommend you publish version 0.1.0 sooner than later, to reserve the project name.
Recommend that you git rm setup.py.
In the modern packaging ecosystem
I thought we try to simply rely on the pyproject.toml config,
which looks good to me.
Keeping redundant
boilerplate in two files instead of one seems undesirable.
ASCII art
Recommend you avoid long decorative ========== lines.
Plus, most of those comments are redundant with
the (well chosen) function names.
The only place where they're helpfully organizing things is for dedup.
Recommend that you take advantage of the language's ability
to organize code concepts,
by evicting text_fingerprint() and is_duplicate()
to a new dedup.py module.
BTW thank you, the utils.py module looks well organized.
type annotation
def download_text(url, timeout=30):
"""Download and extract text from a URL.
Returns:
str or None: Cleaned text content, or None if failed
"""
Rather than offer some "str or None" narrative text
for humans to read, prefer to put -> str | None: in the signature.
Then everyone can read it, including mypy and pyrefly.
Down in clean_html() it's enough to say we accept and return str values.
design of Public API
It's unclear why return None is an advantage to the caller.
Sounds like it's just one more thing to check, one more possible pitfall.
The single call site already does essentially:
if text is None or len(text)==0: # report download failure
But of course an URL could plausibly give an empty 200 Success document.
I'm just not seeing how distinguishing None from "" helps us.
request pool
We don't appear to be holding open a port 443 connection
when e.g. we ask Gutenberg for Tom Sawyer and then for Huck Finn.
Given that this project is fundamentally about interacting
with web servers, I have to disagree with your decision to
jettison the familiar and helpful requests package.
The only reason it's not in Batteries Included is it needs
to release more often than annual interpreter releases,
given how Internet protocols and conditions keep changing so often.
Creating lots of SSL contexts seems needlessly painful
and nitty gritty. Just pass in a verify=False parameter
and let the network library sweat those details.
Consider using import protego
for help with robots.txt and request rate.
async
Ship version 0.1.0 to pypi using the current design.
But consider relying on import httpx in a subsequent iteration,
so you can have e.g. a connection to Reddit and a connection to Gutenberg
getting useful download work done concurrently.
error handler
except Exception:
return None
Consider logging elapsed time, url, and the numeric 400- or 500-status that came back. For example, we may want to know the site throttled us due to too many requests.
slop comments
# Unescape HTML entities
...
# Remove script and style tags
...
# Remove HTML tags
...
# Normalize whitespace
...
# Remove leading/trailing whitespace
Yeah, yeah, we were vibing with an LLM, I get it.
But those remarks are vacuous and are redundant with
what the source code assignments eloquently state.
Delete such remarks prior to a git commit.
unit tests
I didn't see any. Automated tests should be exercising those regexes, to verify they behave as you think they behave.
fingerprints
Summarizing a blog post or a novel by just its eight initial words is, ummmm, surprising. I'm especially worried that boilerplate website navbar, or repeated license / copyright notice, will make many documents on a given site appear "identical".
unused param
def is_duplicate(text, seen_fingerprints, threshold=0.8):
Remove the unused threshold, please.
No need to have an IDE auto-complete that for some hapless app author.
one-liner
exclude_extensions = set(ext.lower().lstrip('.') for ext in (exclude_extensions or [])) | set(DEFAULT_EXCLUDE_EXTENSIONS)
This is not at all easy to read. Prefer to let black -S *.py worry about laying out code within some reasonable line width:
exclude_extensions = (
set(ext.lower().lstrip('.') for ext in (exclude_extensions or []))
) | set(DEFAULT_EXCLUDE_EXTENSIONS)
design of Public API
Please elide the use_default_excludes parameter.
Better to let caller pass in DEFAULT_EXCLUDE_DOMAINS
and/or DEFAULT_EXCLUDE_EXTENSIONS.
Also, I wish we were pulling one or both of those in from lazynlp, rather than copy-n-pasting them into this project.
pathlib
... = open(os.path.join(output_dir, 'success.txt'), 'w')
... = open(os.path.join(output_dir, 'failed.txt'), 'w')
The ancient os.path module works well enough.
But when authoring new code, prefer from pathlib import Path.
imports at top
# Check exclusion filters
from urllib.parse import urlparse
parsed = urlparse(url)
No.
Put the import where it belongs.
Also, the comment suggests that we should Extract Helper function,
in the hopes of letting the body of the for loop appear in a single screenful
without vertical scrolling.
def main():
"""Command-line interface."""
import sys
import argparse
Again, no.
Those two belong at top of module.
import logging
failed_log.write(f"{url}\texcluded_domain\n")
Consider doing that with a logger,
so you consistently get automatic timestamped log entries.
That will help you understand why a given crawl was fast or slow.
cracking argv
if len(sys.argv) >= 2 and sys.argv[1] == 'stats':
if len(sys.argv) < 3:
...
if len(sys.argv) >= 2 and sys.argv[1] == 'merge':
if len(sys.argv) < 4:
I imagine those work properly?
But it seems like you're working too hard.
Why didn't argparse deal with optional items for you already?
When I import typer I always get appropriate CLI diagnostics
displayed automatically, without jumping through such hoops.
be lazy
After you (quickly) ship version 0.1.0,
I urge you to consider using uv to manage dependencies
listed in pyproject.toml, such as httpx.
Consider adding a make install Makefile, or a shell script,
that shows how to pull in deps and assemble a small text corpus.
This project should focus on its core value-add, which is managing large text datasets. To the extent that you can outsource any of the network minutiae to some well tested library that has already worked out the details, I encourage you to do so.
-
\$\begingroup\$ added logging; switched to argparse. \$\endgroup\$Rustem K– Rustem K2025年10月30日 05:16:05 +00:00Commented Oct 30 at 5:16
-
\$\begingroup\$ added parallel implementation with httpx \$\endgroup\$Rustem K– Rustem K2025年10月30日 05:50:34 +00:00Commented Oct 30 at 5:50
-
1\$\begingroup\$ "remarks are vacuous and are redundant with what the source code assignments eloquently state" I disagree. The source code doesn't actually do what the comments say, so the comments are the way to know the current behavior is a BUG not intended. \$\endgroup\$Ben Voigt– Ben Voigt2025年10月31日 15:19:22 +00:00Commented Oct 31 at 15:19
You have docstrings and comments that are very useful. Your code also seems to be very well organized and structured. The API is simple enough. My only questions/suggestions are:
Security Concerns
You specify:
# Create SSL context that doesn't verify certificates
context = ssl.create_default_context()
context.check_hostname = False
context.verify_mode = ssl.CERT_NONE
This exposes you to man-in-the-middle attacks. You don't care?
Performance
Function download_and_clean reads in a number of URLs and proceeds to download each URL serially. Performance could be greatly enhanced if you used a multithreading pool (either class multiprocessing.pool.ThreadPool or concurrent.futures.ThreadPoolExecutor) to download the URLs more concurrently (N downloads running concurrently where you can choose the value of N).
Edge Cases
In your desire to only depend on the standard library you have implemented function clean_html to "remove HTML tags and clean text." I can only assume your purpose is to extract from the HTML the text, e.g. the textual contents of a <p> tag. But this cannot be done correctly without using an actual HTML parser, which is not provided by the standard library.
The first thing you do is html.unescape the text. The user may have in part:
<p>
You should not use the <H1> tag. For example, <H1>Some Title</H1>
<p>
After un-escaping you end up with:
<p>
You should not use the <H1> tag. For example, <H1>Some Title</H1>
<p>
But then you execute on this result text = re.sub(r'<[^>]+>', '', text), which produces:
You should not use the tag. For example, Some Title
This is probably not what you would want. If you were to first remove tags and then un-escape the results, it would be an improvement -- but far from perfect.
Function test_fingerprint
Why create a digest of only the first 8 "words" (i.e. tokens created by splitting on whitespace)? Is it too "costly" to create a digest on the entire text? You do recover some processing time on not having to split the text. Thanks to Stef for making the point that it is not necessary to split the entire text just to get the first N words:
words = text.lower().split(maxsplit=n)[:n]
Function download_and_clean
You have:
# Normalize filters
if use_default_excludes:
exclude_domains = set(exclude_domains or []) | set(DEFAULT_EXCLUDE_DOMAINS)
exclude_extensions = set(ext.lower().lstrip('.') for ext in (exclude_extensions or [])) | set(DEFAULT_EXCLUDE_EXTENSIONS)
else:
exclude_domains = set(exclude_domains or [])
exclude_extensions = set(ext.lower().lstrip('.') for ext in (exclude_extensions or []))
Based on the DRY Principle this should be expressed as:
# Normalize filters
exclude_domains = set(exclude_domains or [])
exclude_extensions = set(ext.lower().lstrip('.') for ext in (exclude_extensions or []))
if use_default_excludes:
exclude_domains |= set(DEFAULT_EXCLUDE_DOMAINS)
exclude_extensions |= set(DEFAULT_EXCLUDE_EXTENSIONS)
Later on you have:
# Open log files
success_log = open(os.path.join(output_dir, 'success.txt'), 'w')
failed_log = open(os.path.join(output_dir, 'failed.txt'), 'w')
...
# Cleanup
success_log.close()
failed_log.close()
You have been using context managers to ensure files are closed properly even after exceptions. Why not here?
with open(os.path.join(output_dir, 'success.txt'), 'w') as success_log, \
open(os.path.join(output_dir, 'failed.txt'), 'w') as failed_log:
...
This function has:
# Print summary
print_stats(stats)
return stats
Your main function calls this function but does not do anything with the returned value. Do you really need to both print and return stats?
-
1\$\begingroup\$ Regarding test_fingerprint:
words = text.lower().split()[:n]can be replaced withwords = text.lower().split(maxsplit=n)[:n]to avoid splitting the whole text when only the first few words are wanted \$\endgroup\$Stef– Stef2025年10月29日 13:08:29 +00:00Commented Oct 29 at 13:08 -
\$\begingroup\$ @Stef Good point! Thanks. \$\endgroup\$Booboo– Booboo2025年10月29日 14:54:38 +00:00Commented Oct 29 at 14:54
-
\$\begingroup\$ Although I don't actually know how much this actually saves - this still cuts the text in n+1 strings, and presumably the last string, which we don't need, will be very long and I don't know whether it's copied or if python is smart enough to use the same underlying char array as the original text. \$\endgroup\$Stef– Stef2025年10月30日 09:00:23 +00:00Commented Oct 30 at 9:00
-
\$\begingroup\$ Another evidence in favor of "cannot be done correctly without using an actual HTML parser" is what the code currently does if there are multiple script or multiple style tags, not contiguous. \$\endgroup\$Ben Voigt– Ben Voigt2025年10月31日 15:22:03 +00:00Commented Oct 31 at 15:22
Overview
The code layout is good, and you added ample documentation with usage examples.
try/except
In the download_text function, the except statements are many lines away from the try lines.
PEP 8 recommends that you limit the try clause to the absolute minimum amount
of code necessary to avoid masking bugs. It is hard to keep track of what
line (or lines) are expected to result in the exception.
Import
The ruff tool identifies this line as unused:
from pathlib import Path
It can be deleted.
Consider moving the import line from the download_and_clean function to the top of the code:
from urllib.parse import urlparse
Documentation
It is great that you have docstrings for your functions, as recommended by the PEP 8 style guide.
Also consider using type hints to describe input and return types for the functions to make the code more self-documenting.
Command line
It seems redundant to use both argv and argparse. I see the comment about
backward compatibility, but I think you should try to only use argparse.
Portability
I'm not a big fan of fancy Unicode characters in source code,
like the characters in the download_and_clean function docstring.
Sometimes they don't render well in editors, and other times
they don't render well in output generated by the code.
Returning None
If we look at download_text there is an opportunity.
def download_text(url, timeout=30): """Download and extract text from a URL. Returns: str or None: Cleaned text content, or None if failed """ try: # Download headers = {'User-Agent': 'Mozilla/5.0'} req = urllib.request.Request(url, headers=headers) # Create SSL context that doesn't verify certificates context = ssl.create_default_context() context.check_hostname = False context.verify_mode = ssl.CERT_NONE with urllib.request.urlopen(req, timeout=timeout, context=context) as response: content = response.read().decode('utf-8', errors='ignore') # Basic HTML cleaning text = clean_html(content) return text if text.strip() else None except Exception: return None
When control flow hits the end of a Python function without hitting an explicit return of some value, None is returned. Thus your code can be:
def download_text(url, timeout=30):
"""Download and extract text from a URL.
Returns:
str or None: Cleaned text content, or None if failed
"""
try:
# Download
headers = {'User-Agent': 'Mozilla/5.0'}
req = urllib.request.Request(url, headers=headers)
# Create SSL context that doesn't verify certificates
context = ssl.create_default_context()
context.check_hostname = False
context.verify_mode = ssl.CERT_NONE
with urllib.request.urlopen(req, timeout=timeout, context=context) as response:
content = response.read().decode('utf-8', errors='ignore')
# Basic HTML cleaning
text = clean_html(content)
if text.strip():
return text
except Exception:
pass