Name	Name	Last commit message	Last commit date
Latest commit History 266 Commits
hanzo	hanzo
.gitignore	.gitignore
.hgignore	.hgignore
.hgtags	.hgtags
.travis.yml	.travis.yml
LICENSE	LICENSE
README.md	README.md
make-deb.sh	make-deb.sh
pylint.rc	pylint.rc
pyproject.toml	pyproject.toml
tox.ini	tox.ini
uv.lock	uv.lock
warcunpack_ia.py	warcunpack_ia.py

Warctools

WARC (Web ARChive) file tools for python 2/3 based on the WARC 1.0 spec and compatible with the Internet Archive's ARC File Format originally developed by Hanzo Archives.

Install

pip install warctools

Python Usage

from hanzo import warctools

Python Examples

Write a WARC file:

import os
from hanzo import warctools
def write():
 headers = [
 (b'WARC-Type', b'warcinfo'),
 (b'WARC-Date', b'2019-11-19T23:08:51.182451Z'),
 (b'WARC-Filename', b'CRAWL-20191119230851-00000-hostname.warc.gz'),
 (b'WARC-Record-ID', b'<urn:uuid:8cc5dcae-0b21-11ea-842b-525476278032>')
 ]
 content_type = b'application/warc-fields'
 content = 'This\nis\nonly\na\ntest\n'.encode()
 fname = 'test.warc.gz'
 mode = 'ab'
 if not os.path.exists(fname):
 mode = 'wb'
 with open(fname, mode) as _fh:
 content = (content_type, content)
 record = warctools.WarcRecord(headers=headers, content=content)
 record.write_to(_fh, gzip="record")

Command-line Usage

warcvalid

Returns 0 if the arguments are all valid W/ARC files, non-zero on error.

[warctools] $ warcvalid -h
Usage: warcvalid [options] warc warc warc
Options:
 -h, --help show this help message and exit
 -l LIMIT, --limit=LIMIT
 -I INPUT_FORMAT, --input=INPUT_FORMAT
 -L LOG_LEVEL, --log-level=LOG_LEVEL

warcdump

Writes human readable summary of warcfiles. Autodetects input format when filenames are passed, i.e recordgzip vs plaintext, WARC vs ARC. Assumes uncompressed warc on stdin if no args.

[warctools] $ warcdump -h
Usage: warcdump [options] warc warc warc
Options:
 -h, --help show this help message and exit
 -l LIMIT, --limit=LIMIT
 -I INPUT_FORMAT, --input=INPUT_FORMAT
 -L LOG_LEVEL, --log-level=LOG_LEVEL

warcfilter

Searches all headers for regex pattern. Autodetects and stdin like warcdump. Prints out a WARC format by default. Use -i to invert search. Use -U to constrain to url. Use -T to constrain to record type. Use -C to constrain to content-type.

$ warcfilter -h
Usage: warcfilter [options] pattern warc warc warc
Options:
 -h, --help show this help message and exit
 -l LIMIT, --limit=LIMIT
 limit (ignored)
 -I INPUT_FORMAT, --input=INPUT_FORMAT
 input format (ignored)
 -i, --invert invert match
 -U, --url match on url
 -T, --type match on (warc) record type
 -C, --content-type match on (warc) record content type
 -H, --http-content-type
 match on http payload content type
 -D, --warc-date match on WARC-Date header
 -L LOG_LEVEL, --log-level=LOG_LEVEL
 log level(ignored)

warc2warc

Autodetects compression on file args. Assumes uncompressed stdin if none. Use -Z to write compressed output, i.e warc2warc -Z input > input.gz. Should ignore buggy records in input.

[warctools] $ warc2warc -h
Usage: warc2warc [options] url (url ...)
Options:
 -h, --help show this help message and exit
 -o OUTPUT, --output=OUTPUT
 output warc file
 -l LIMIT, --limit=LIMIT
 -I INPUT_FORMAT, --input=INPUT_FORMAT
 (ignored)
 -Z, --gzip compress output, record by record
 -D, --decode_http decode http messages (strip chunks, gzip)
 -L LOG_LEVEL, --log-level=LOG_LEVEL
 --wget-chunk-fix skip transfer-encoding headers in http records, when
 decoding them (-D)

arc2warc

Creates a crappy WARC file from arc files on input. A handful of headers are preserved. Use -Z to write compressed output, i.e arc2warc -Z input.arc > input.warc.gz

[warctools] $ arc2warc -h
Usage: arc2warc [options] arc (arc ...)
Options:
 -h, --help show this help message and exit
 -o OUTPUT, --output=OUTPUT
 output warc file
 -l LIMIT, --limit=LIMIT
 -Z, --gzip compress
 -L LOG_LEVEL, --log-level=LOG_LEVEL
 --description=DESCRIPTION
 --operator=OPERATOR
 --publisher=PUBLISHER
 --audience=AUDIENCE
 --resource=RESOURCE
 --response=RESPONSE

warcindex

DEPRECATED, use CDX-writer branch.

#WARC-filename offset warc-type warc-subject-uri warc-record-id content-type content-length
warccrap/mywarc.warc 1196018 request /images/slides/hanzo_markm__wwwoh.pdf <urn:uuid:fd1255a8-d07c-11df-b125-12313b0a18c6> application/http;msgtype=request 193
warccrap/mywarc.warc 1196631 response http://www.hanzoarchives.com/images/slides/hanzo_markm__wwwoh.pdf <urn:uuid:fd2614f8-d07c-11df-b125-12313b0a18c6> application/http;msgtype=response 3279474

Notes

arc2warc uses the conversion rules from the earlier arc2warc.c as a starter for converting the headers
I haven't profiled the code yet (and don't plan to until it falls over)
Warcvalid barely skirts some of the iso standard, missing things:
- strict whitespace
- required headers check
- mime quoted printable header encoding
- treating headers as utf8

ToDo

Lots more testing
Support pre-1.0 WARC files
Add more documentation
Support more commandline options for output and filenames
S3 urls

Credits

Originally developed by "tef" thomas.figg@hanzoarchives.com.

@internetarchive

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

internetarchive/warctools

Folders and files

Latest commit

History

Repository files navigation

Warctools

Install

Python Usage

Python Examples

Command-line Usage

warcvalid

warcdump

warcfilter

warc2warc

arc2warc

warcindex

Notes

ToDo

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Warctools

Install

Python Usage

Python Examples

Command-line Usage

warcvalid

warcdump

warcfilter

warc2warc

arc2warc

warcindex

Notes

ToDo

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages