Re: Parsing big compressed XML files

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Parsing big compressed XML files
From: KHMan <keinhong@...>
Date: 2014年4月07日 10:31:27 +0800

On 4/7/2014 7:03 AM, Valerio Schiavoni wrote:

And for your curiosity, on one of the smaller files, i get
sensible differences between 7z and bz2 :
$time 7z e -so -bd
enwiki-20140304-pages-meta-history8.xml-p000662352p000665000.7z
2>/dev/null > /dev/null
7z e -so -bd
enwiki-20140304-pages-meta-history8.xml-p000662352p000665000.7z
6.10s user 0.02s system 99% cpu 6.120 total
$time bzcat <
enwiki-20140304-pages-meta-history8.xml-p000662352p000665000.bz2 >
/dev/null
bzcat <
enwiki-20140304-pages-meta-history8.xml-p000662352p000665000.bz2 >
 61.26s user 0.14s system 99% cpu 1:01.41 total

It's strange Wikipedia has not moved to xz but is using a mix of7z and bzip2, even the Linux kernel has moved to tar.xz. Both 7zand xz uses the newer LZMA2.Unfortunately BWT+Huffman in bzip2 has roughly symmetrical timesfor compression-decompression. 7z/xz decompression is notsymmetrical to compression, and will always be a lot faster thanbzip2 at big block sizes. For multiple runs, recompressing thekaboodle to 7z/xz will probably greatly improve your runtimes.Things like lzo is a lot faster but will likely compress to only50% or so for text data.

--
Cheers,
Kein-Hong Man (esq.)
Kuala Lumpur, Malaysia

References:
- Parsing big compressed XML files, Valerio Schiavoni
- Re: Parsing big compressed XML files, Petite Abeille
- Re: Parsing big compressed XML files, Valerio Schiavoni
- Re: Parsing big compressed XML files, Petite Abeille
- Re: Parsing big compressed XML files, Valerio Schiavoni
- Re: Parsing big compressed XML files, Valerio Schiavoni

Prev by Date: Re: mathlib
Next by Date: Re: mathlib
Previous by thread: Re: Parsing big compressed XML files
Next by thread: Re: Parsing big compressed XML files
Index(es):
- Date
- Thread