Re: Parsing big compressed XML files
[
Date Prev][
Date Next][
Thread Prev][
Thread Next]
[
Date Index]
[
Thread Index]
- Subject: Re: Parsing big compressed XML files
- From: KHMan <keinhong@...>
- Date: 2014年4月07日 10:31:27 +0800
On 4/7/2014 7:03 AM, Valerio Schiavoni wrote:
And for your curiosity, on one of the smaller files, i get
sensible differences between 7z and bz2 :
$time 7z e -so -bd
enwiki-20140304-pages-meta-history8.xml-p000662352p000665000.7z
2>/dev/null > /dev/null
7z e -so -bd
enwiki-20140304-pages-meta-history8.xml-p000662352p000665000.7z
6.10s user 0.02s system 99% cpu 6.120 total
$time bzcat <
enwiki-20140304-pages-meta-history8.xml-p000662352p000665000.bz2 >
/dev/null
bzcat <
enwiki-20140304-pages-meta-history8.xml-p000662352p000665000.bz2 >
61.26s user 0.14s system 99% cpu 1:01.41 total
It's strange Wikipedia has not moved to xz but is using a mix of
7z and bzip2, even the Linux kernel has moved to tar.xz. Both 7z
and xz uses the newer LZMA2.
Unfortunately BWT+Huffman in bzip2 has roughly symmetrical times
for compression-decompression. 7z/xz decompression is not
symmetrical to compression, and will always be a lot faster than
bzip2 at big block sizes. For multiple runs, recompressing the
kaboodle to 7z/xz will probably greatly improve your runtimes.
Things like lzo is a lot faster but will likely compress to only
50% or so for text data.
--
Cheers,
Kein-Hong Man (esq.)
Kuala Lumpur, Malaysia