coURLan

Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters

Popularity
2.1
Growing
Activity
2.3
-
158
0
11

Description

Avoid loosing bandwidth capacity and processing time for webpages which are probably not worth the effort. This library provides an additional brain for web crawling, scraping and management of Internet archives. Specific fonctionality for crawlers: stay away from pages with little text content or target synoptic pages explicitly to gather links.

This navigation help targets text-based documents (i.e. currently web pages expected to be in HTML format) and tries to guess the language of pages to allow for language-focused collection. Additional functions include straightforward domain name extraction and URL sampling.

Programming language: Python
License: Apache License 2.0
Latest version: v0.6.0

coURLan alternatives and similar packages

Based on the "URL Manipulation" category.
Alternatively, view courlan alternatives based on common mentions on social networks and blogs.

* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.

Do you think we are missing an alternative of coURLan or a related project?

Add another 'URL Manipulation' Package

Do not miss the trending, packages, news and articles with our weekly report.

Awesome Python is part of the LibHunt network. Terms. Privacy Policy.

(CC)
BY-SA
We recommend Spin The Wheel Of Names for a cryptographically secure random name picker.

AltStyle によって変換されたページ (->オリジナル) /