I came across this blog post (and an archived version with better readability), by Adam Drake from around a year ago, which is now making the rounds again.
I made some improvements to his code, but wish to see if there are additional tweaks that could be made to make it run even faster.
The task is to extract chess game results from PGN files. The files contain sequences of games, where each has a header which contains a Result
line like this:
[Result "1-0"] [Result "0-1"] [Result "1/2-1/2"]
These three results indicate a white win, a black win, and a draw, respectively. The task is to simply collect and report a summary of these results.
Here is my solution to be reviewed:
find . -type f -name '*.pgn' -print0 |
xargs -0 mawk -F '[-"]' '/Result/ { ++a[2ドル]; }
END { print a["1"]+a["0"]+a["1/2"], a["1"], a["0"], a["1/2"] }'
I was skeptical of using find
over just listing the files in the reference data set, but my timings indicate that this is actually faster than a shell wildcard (Bash 4.3.11(1)-release).
tripleee@xvbvntv:ChessData$ time find . -type f -name '*.pgn' | wc -l
3025
real 0m0.014s
user 0m0.008s
sys 0m0.011s
tripleee@xvbvntv:ChessData$ time printf '%s\n' */*.pgn | wc -l
3025
real 0m0.037s
user 0m0.032s
sys 0m0.010s
The optimization I originally had in mind was to close the data file after reading the Result
line, but as it turns out, the reference data set files contain multiple games, and thus multiple results (and the game portion is a lot smaller than I thought it would be).
tripleee@xvbvntv:ChessData$ time find . -type f -name '*.pgn' -print0 |
> xargs -0 mawk -F '[-"]' '/Result/ { ++a[2ドル]; }
> END { print a["1"]+a["0"]+a["1/2"], a["1"], a["0"], a["1/2"] }'
6829065 2602614 1974505 2251946
real 0m50.232s
user 0m19.820s
sys 0m2.542s
This is as far as I got. (An earlier version, based on the blog post, attempted parallel processing, but removing that was the biggest performance improvement I made.) I don't think switching from Awk to a "bigger" language would buy me any serious benefits — one of the strengths of Awk is that it's quick to write, parse, and execute. (Compiled code would probably be a tad faster, but I don't think I want to go there; let's try to keep a realistic cost/benefit ratio.) Are there any additional improvements to be made, to make it go even faster?
Here is the data set referenced in the blog, which I used to obtain my results. There are also other datasets like Lumbra's Giga Base, the FICS Games Database and the lichess.org open database that you could use.
Sadly, the Hadoop experiment which is linked from Adam's blog is now a 404, but as usual an archived snapshot is available on the Internet Archive. He quotes the original author as clocking 26 minutes to process the smaller data set on seven c1.medium
instances. His own code took 12 seconds, but I was unable to reproduce that — with this data set, it took 2 minutes and 30 seconds on my computer, so I have improved over that by some 60%.
4 Answers 4
I don't see the reason for chaining with -print0 | xargs -0
.
It's better and simpler to use -exec
:
find . -type f -name '*.pgn' -exec mawk -F '[-"]' '...' {} +
I don't see a way to make the AWK code faster, but:
- Some of the double-quoting is unnecessary
- I would add a space around operators for somewhat better readability
- A semicolon can be dropped
Like this:
/Result/ { ++a[2ドル] }
END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }
-
2\$\begingroup\$
-exec \;
runs the command once for every file so that will be many moreawk
processes being run.-exec +
will behave more likexargs
and bunch them up so that might be an improvement but I doubt it will be a significant one as the single process is likely not a major contributing factor to the runtime. \$\endgroup\$Etan Reisner– Etan Reisner2015年01月20日 20:56:11 +00:00Commented Jan 20, 2015 at 20:56 -
\$\begingroup\$ @EtanReisner You're right, I changed
\;
to+
\$\endgroup\$janos– janos2015年01月20日 20:59:08 +00:00Commented Jan 20, 2015 at 20:59 -
1\$\begingroup\$ I kept all array keys quoted for consistency and legibility. I doubt it matters for performance; but if it does, I would guess that explicit strings are faster (integer keys would have to be coerced to strings, then converted to hashes for associative array lookup). It's just the final
END
calculation so it's not going to be measurable. \$\endgroup\$tripleee– tripleee2015年01月21日 04:36:48 +00:00Commented Jan 21, 2015 at 4:36
It might be interesting to see if instead of using xargs
you passed the filenames directly to awk
as input and then manually used getline
on them (or added them to ARGV
and let awk
handle them normally) if that helps any.
I assume from your explanation that xargs
is only spawning one command to handle the files but I wonder how much time and effort it is putting into doing that work and whether that would exceed the extra processing time awk
requires to replace it.
-
\$\begingroup\$
xargs
doesn't process files, per se. It just converts lines from stdin to command-line arguments — just string processing and command launching. \$\endgroup\$200_success– 200_success2015年01月20日 17:55:24 +00:00Commented Jan 20, 2015 at 17:55 -
2\$\begingroup\$ @200_success I know. (Did I write something that indicated otherwise?) But it needs to read all the input and assemble the entire command line before it can even start running
awk
and that is going to take a bit of time and memory that piping straight toawk
does not. Hence the idea. \$\endgroup\$Etan Reisner– Etan Reisner2015年01月20日 17:57:28 +00:00Commented Jan 20, 2015 at 17:57
TL;DR
frawk -F ':' '/w:/{ white = 2ドル }
/b:/{ black = 2ドル }
/d:/{ draw = 2ドル }
END { print white + black + draw, white, black, draw }' \
<(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
--crlf --no-unicode '^\[Result "1-0"\]$' |
frawk '{ s += 0ドル } END { print "w:" s }') \
<(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
--crlf --no-unicode '^\[Result "0-1"\]$' |
frawk '{ s += 0ドル } END { print "b:" s }') \
<(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
--crlf --no-unicode '^\[Result "1/2-1/2"\]$' |
frawk '{ s += 0ドル } END { print "d:" s }')
Finishes in 2.53 seconds versus OP's (20.05 seconds), jano's (19.71 seconds) and Adam's (6.48 seconds) solutions.
Potential Bugs
Your solution, while able to eliminate one piped command relative to Adam's,
introduces a new limitation: the output may consists of multiple intermediate
results rather than one final statistics.
From the xargs
man page:
-s max-chars, --max-chars=max-chars
Use at most max-chars characters per command line, including the command
and initial-arguments and the terminating nulls at the ends of the
argument strings. The largest allowed value is system-dependent, and is
calculated as the argument length limit for exec, less the size of your
environment, less 2048 bytes of headroom. If this value is more than
128KiB, 128Kib is used as the default value; otherwise, the default
value is the maximum. 1KiB is 1024 bytes. xargs automatically adapts
to tighter constraints.
I've inadvertently hit the default maximum max-chars
limit while testing.
You may force the issue by:
- creating a directory with a very long path name to hold the data set and running your commands from there,
- increasing the size of your environment by populating it with random data:
export __=$(perl -e 'print "x" x $ARGV[0]' 100000)
- decreasing the maximum stack size temporarily by running
whereulimit -S -s STACK_SIZE
STACK_SIZE
is some number lower than the default 8192, e.g. 256, - letting the number of PGN files grow organically,
or some combination of the above.
I'm quite surprised that your attempt at parallel execution slowed you down. If I adapt your changes to Adam's code, it actually ranks among the fastest solutions I've tried. Here's what you'd have come up with:
# op2.sh
n=$(nproc)
find . -type f -name '*.pgn' -print0 |
xargs --null -n $n --max-procs=$n mawk -F '[-"]' \
'BEGIN { a[1] = a[0] = a["1/2"] = 0 }
/Result/{ ++a[2ドル] }
END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }' |
mawk '{ games += 1ドル; white += 2ドル; black += 3ドル; draw += 4ドル; }
END { print games, white, black, draw }'
Notice here I've added a BEGIN
rule to initialize the array.
This is done to ensure the statistics will be correct:
if the collection of files fed to an invocation of awk is devoid of certain
outcomes (white wins, white loses, or a draw),
the slots counting the missing outcomes would output empty strings,
which when passed to the final aggregating AWK program
would result in misgrouping and tainting the counts.
You can see this effect come into play by changing the -n4
option of xargs
to -n1
or -n2
in Adam's solution reproduced below:
# adam.sh
find . -type f -name '*.pgn' -print0 |
xargs -0 -n4 -P4 mawk \
'/Result/{
split(0,ドル a, "-");
res = substr(a[1], length(a[1]), 1);
if (res == 1) white++;
if (res == 0) black++;
if (res == 2) draw++;
}
END { print white + black + draw, white, black, draw }' |
mawk '{ games += 1ドル; white += 2ドル; black += 3ドル; draw += 4ドル; }
END { print games, white, black, draw }'
A fixed string search for Result
, while fast and simple,
is a bit naive.
In a prior version of the file Britbase/197008bcf.pgn
,
there is a comment in the movetext that includes both the string Result
and a score 1-0
:
1. e4 c6 2. d4 g6 3. Nc3 Bg7 4. Be3 d6 5. Qd2 Nd7 6. f3 Qa5 7. Bc4 b5 8. Bb3 b4
9. Nd1 Ba6 10. Ne2 Ngf6 11. c3 c5 12. cxb4 cxb4 13. Nf2 O-O 14. O-O Rfc8 15.
Rfc1 Nb6 16. Nf4 Nc4 17. Bxc4 Bxc4 18. N4d3 Rab8 19. a3 Qb5 20. Nxb4 a5 21.
Nbd3 Nd7 22. b4 a4 23. Nb2 Bb3 24. Rxc8+ Rxc8 25. Rc1 Rxc1+ 26. Qxc1 Qe2 27.
Nfd1 {Result given as 1-0 in the tournament bulletin but it is clear from
published results and crosstables, as well as the position on the board (where
Black has mate in one) that the true result was 0-1 - JS, BritBase} 0-1
This is a rare occurrence, indeed, but I'd personally perfer the more explicit search pattern $'^\[Result "(1-0|0-1|1/2-1/2)"\]\r*$'
that adheres to the PGN export format described in the PGN standard document.
It is still possible to have false negatives, but trying to match the versatile PGN import format is likely to hurt performance and needlessly increase complexity.
Better Tools
Now onto my first observation at possible performance improvement ideas. I noticed the result lines we're interested in make up only a tiny fraction (less than 5%) of the contents in those PGN files, so I figured that filtering the input files for the result lines first via the expert tool grep might be more efficient than feeding them directly to awk, which does unnecessary bookkeeping that could accumulate considerable overhead. This was actually Adam's first optimization idea, too. I just spun it off by subsequently applying his last performance idea, finding faster replacements for the tools, which got me to ripgrep. ripgrep is an amazing piece of software that does parallel searches, which was Adam's second optimization idea, and I'm getting it for free!
Then I wondered, after all these years, is mawk still the fastest AWK? After 20 years of no activity, Mike Brennan, the author of mawk, released mawk 1.9.9.6, a beta release for mawk 2 based on mawk 1.3.3's code. Sadly, it didn't outperform mawk 1.3.4 in my testing. There is also frawk, a fast and mostly compatible AWK implementation written in Rust. It requires more steps to install, and in the end, with the exception of one very simple case of summing numbers (one per line), it runs slower in all backend, optimization and parallelization configurations.
After ample experimentation, I came up with two solutions that are blazingly fast on modern hardware:
# rg.sh
rg --glob='*.pgn' --text --no-filename --no-line-number --crlf --no-unicode \
'^\[Result "(1-0|0-1|1/2-1/2)"\]$' |
mawk -F '[-"]' 'BEGIN { a[1] = a[0] = a["1/2"] = 0 }
{ ++a[2ドル] }
END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }'
# rg3.sh
frawk -F ':' '/w:/{ white = 2ドル }
/b:/{ black = 2ドル }
/d:/{ draw = 2ドル }
END { print white + black + draw, white, black, draw }' \
<(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
--crlf --no-unicode '^\[Result "1-0"\]$' |
frawk '{ s += 0ドル } END { print "w:" s }') \
<(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
--crlf --no-unicode '^\[Result "0-1"\]$' |
frawk '{ s += 0ドル } END { print "b:" s }') \
<(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
--crlf --no-unicode '^\[Result "1/2-1/2"\]$' |
frawk '{ s += 0ドル } END { print "d:" s }')
Compression
Before I present my benchmark results, I want to touch on one more idea that might improve performance, especially in setups with slower disks: searching through compressed archives. This is essentially trading disk seeking time with CPU time. A reduction in file size and count (from many smaller files into one big, compressed tar file) means a reduction in disk IO, which translates to faster execution time if it is the right trade-off.
We may download the data set on GitHub as a gzip file and decompress it using pigz (parallel gzip):
tar --extract --use-compress-program='pigz' \
--file='./ChessData-master.tar.gz'
This runs in 45 seconds on my machine. GZIP is not the most optimal lossless data compression file format, neither is pigz the most parallel decompressor. We could switch to Zstandard by recompressing the data set with pzstd:
tar --create --use-compress-program='pzstd' \
--file='./ChessData-master.tar.zst' ./ChessData-master
This runs in 30 seconds on my machine and produces a 3.5 GB zstd file, which is 200 MB down from the 3.7 GB gzip source. Note that pzstd can not fully benefit from parallel decompression if the zstd file is compressed using zstd, as zstd does not insert the additional markers needed for parallel decompression.
A solution to the problem statement utilizing pzstd is:
# pzstd.sh
tar --extract --to-stdout --use-compress-program='pzstd' \
--wildcards --wildcards-match-slash \
--file='./ChessData-master.tar.zst' '*.pgn' |
mawk -F '[-"]' 'BEGIN { a[1] = a[0] = a["1/2"] = 0 }
/Result/{ ++a[2ドル] }
END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }'
The Benchmarks
Benchmark PC1 specs:
- CPU: 8
- RAM: 16 GB
- Disk: NVMe SSD
- OS: Fedora 40
Data sets:
- ChessData-master
- LumbrasGigaBase (pre-1899 to 2023, Version 2024年04月09日)
Benchmark types:
- Normal: before each run, instruct the OS to drop caches.
sync echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null
- Cached: do not drop caches before each run.
- POSIX commands: only POSIX utilities are allowed;
substitute
mawk
andfrawk
withawk
(gawk
), andrg
withgrep
. Due to incompatibilities in the command-lines,rg
(ripgrep) could not simply be replaced withgrep
, sorg.sh
andrg3.sh
had to be rewritten asgrep.sh
andgrep3.sh
.
Note: adam2.sh
is adam.sh
patched to utilize all available cores
(-n4 -P4
becomes -n$n -P$n
where n=$(nproc)
.
I didn't test for the best possible value for -n
as that would be ad hoc micro-optimization,
so I just made it the same as the number for -P
.
ChessData-master ChessData-master, cached ChessData-master, POSIX commands LumbrasGigaBase
Let's examine the median running times of each contender in the benchmarks.
ChessData-master:
rg3.sh 2.53
rg.sh 3.25
op2.sh 3.67
adam2.sh 4.00
adam.sh 6.48
janos.sh 19.71
op.sh 20.05
pzstd.sh 25.88
ChessData-master, cached:
rg3.sh 0.91
rg.sh 1.69
op2.sh 2.04
adam2.sh 2.41
adam.sh 4.35
op.sh 12.52
janos.sh 13.11
pzstd.sh 21.54
ChessData-master, POSIX commands:
op2.sh 3.74
adam2.sh 3.94
adam.sh 6.54
grep.sh 14.09
janos.sh 19.49
op.sh 19.90
grep3.sh 25.02
pzstd.sh 55.96
LumbrasGigaBase:
rg3.sh 4.02
rg.sh 4.47
adam.sh 12.09
op2.sh 14.20
adam2.sh 15.82
op.sh 20.76
janos.sh 21.26
pzstd.sh 24.36
The clear winner is rg3.sh
.
There's an easily-fixed bug if not all the possible results are encountered by the program. Instead of printing empty string for missing results, we should force them to be converted to number 0 using unary +
operator:
print +a["1"]+a["0"]+a["1/2"], +a["1"], +a["0"], +a["1/2"]
A more serious bug is caused by passing all the file names via xargs
. If it needs to run more than one AWK process due to command-line length, we will get output lines from each process, which we'll need to accumulate
using something like this:
| mawk '{ g+=1ドル; b+=2ドル; w+=3ドル; d+=4ドル } END { print g, b, w, d }'
Alternatively, use grep commands to extract the result lines from the files, and pipe that stream into a single AWK process. And we can make it faster by not splitting each line - just count the whole lines and only split when we're reading the count. At this point we probably want a separate file for the AWK program:
#!/usr/bin/mawk -f
{ ++count[0ドル] }
END {
for (i in count) split(i, r, "[-\"]") && a[r[2]] += count[i];
delete a["*"]; # ignore unfinished games
for (i in a) total += a[i];
print +total, +a["1"], +a["0"], +a["1/2"];
}
find . -type f -name '*.pgn' -exec grep -h '^\[Result ' {} + | ./78086.awk
Note that some of the corpus files have carriage-return characters causing excess entries in the count
array - that's why I use += count[i]
rather than simple assignment (still faster than filtering the grep output through tr -d '\r'
, in my tests).
Splitting these commands shows that AWK is now no longer the bottleneck:
$ time find . -type f -name '*.pgn' -exec grep -h '^\[Result ' {} + >78086.in
real 0m6.985s
user 0m5.230s
sys 0m1.218s
$ time ./78086.awk <78086.in
10316025 3956979 3035935 3323111
real 0m0.632s
user 0m0.616s
sys 0m0.016s
Any further work ought to focus on creating a faster replacement for the grep command.
-
1\$\begingroup\$ Interesting. With your modifications, I've almost closed the speed gap between
rg.sh
andrg3.sh
. \$\endgroup\$Gao– Gao2024年09月29日 16:33:32 +00:00Commented Sep 29, 2024 at 16:33 -
1\$\begingroup\$ @Gao, I wrote an "unordered prefix-fgrep" program: Find all lines with given prefix from a list of files. \$\endgroup\$Toby Speight– Toby Speight2024年10月02日 16:48:18 +00:00Commented Oct 2, 2024 at 16:48
find
can just print matching paths as it encounters them. That would likely explain whyfind
is faster. \$\endgroup\$