Shell script to count chess game outcomes

Question 1

I came across this blog post (and an archived version with better readability), by Adam Drake from around a year ago, which is now making the rounds again.

I made some improvements to his code, but wish to see if there are additional tweaks that could be made to make it run even faster.

The task is to extract chess game results from PGN files. The files contain sequences of games, where each has a header which contains a Result line like this:

[Result "1-0"]
[Result "0-1"]
[Result "1/2-1/2"]

These three results indicate a white win, a black win, and a draw, respectively. The task is to simply collect and report a summary of these results.

Here is my solution to be reviewed:

find . -type f -name '*.pgn' -print0 |
 xargs -0 mawk -F '[-"]' '/Result/ { ++a[2ドル]; }
 END { print a["1"]+a["0"]+a["1/2"], a["1"], a["0"], a["1/2"] }'

I was skeptical of using find over just listing the files in the reference data set, but my timings indicate that this is actually faster than a shell wildcard (Bash 4.3.11(1)-release).

tripleee@xvbvntv:ChessData$ time find . -type f -name '*.pgn' | wc -l
3025
real 0m0.014s
user 0m0.008s
sys 0m0.011s
tripleee@xvbvntv:ChessData$ time printf '%s\n' */*.pgn | wc -l
3025
real 0m0.037s
user 0m0.032s
sys 0m0.010s

The optimization I originally had in mind was to close the data file after reading the Result line, but as it turns out, the reference data set files contain multiple games, and thus multiple results (and the game portion is a lot smaller than I thought it would be).

tripleee@xvbvntv:ChessData$ time find . -type f -name '*.pgn' -print0 |
> xargs -0 mawk -F '[-"]' '/Result/ { ++a[2ドル]; }
> END { print a["1"]+a["0"]+a["1/2"], a["1"], a["0"], a["1/2"] }'
6829065 2602614 1974505 2251946
real 0m50.232s
user 0m19.820s
sys 0m2.542s

This is as far as I got. (An earlier version, based on the blog post, attempted parallel processing, but removing that was the biggest performance improvement I made.) I don't think switching from Awk to a "bigger" language would buy me any serious benefits — one of the strengths of Awk is that it's quick to write, parse, and execute. (Compiled code would probably be a tad faster, but I don't think I want to go there; let's try to keep a realistic cost/benefit ratio.) Are there any additional improvements to be made, to make it go even faster?

Here is the data set referenced in the blog, which I used to obtain my results. There are also other datasets like Lumbra's Giga Base, the FICS Games Database and the lichess.org open database that you could use.

Sadly, the Hadoop experiment which is linked from Adam's blog is now a 404, but as usual an archived snapshot is available on the Internet Archive. He quotes the original author as clocking 26 minutes to process the smaller data set on seven c1.medium instances. His own code took 12 seconds, but I was unable to reproduce that — with this data set, it took 2 minutes and 30 seconds on my computer, so I have improved over that by some 60%.

Question 2

Tangentially, see also this meta discussion

Question 3

Note that the shell glob needs to sort its results, whereas find can just print matching paths as it encounters them. That would likely explain why find is faster.

Question 4

I don't see the reason for chaining with -print0 | xargs -0. It's better and simpler to use -exec:

find . -type f -name '*.pgn' -exec mawk -F '[-"]' '...' {} +

I don't see a way to make the AWK code faster, but:

Some of the double-quoting is unnecessary
I would add a space around operators for somewhat better readability
A semicolon can be dropped

Like this:

/Result/ { ++a[2ドル] }
 END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }

Question 5

-exec \; runs the command once for every file so that will be many more awk processes being run. -exec + will behave more like xargs and bunch them up so that might be an improvement but I doubt it will be a significant one as the single process is likely not a major contributing factor to the runtime.

Question 6

@EtanReisner You're right, I changed \; to +

Question 7

I kept all array keys quoted for consistency and legibility. I doubt it matters for performance; but if it does, I would guess that explicit strings are faster (integer keys would have to be coerced to strings, then converted to hashes for associative array lookup). It's just the final END calculation so it's not going to be measurable.

Question 8

It might be interesting to see if instead of using xargs you passed the filenames directly to awk as input and then manually used getline on them (or added them to ARGV and let awk handle them normally) if that helps any.

I assume from your explanation that xargs is only spawning one command to handle the files but I wonder how much time and effort it is putting into doing that work and whether that would exceed the extra processing time awk requires to replace it.

Question 9

xargs doesn't process files, per se. It just converts lines from stdin to command-line arguments — just string processing and command launching.

Question 10

@200_success I know. (Did I write something that indicated otherwise?) But it needs to read all the input and assemble the entire command line before it can even start running awk and that is going to take a bit of time and memory that piping straight to awk does not. Hence the idea.

Question 11

TL;DR

frawk -F ':' '/w:/{ white = 2ドル } 
 /b:/{ black = 2ドル } 
 /d:/{ draw = 2ドル } 
 END { print white + black + draw, white, black, draw }' \
 <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
 --crlf --no-unicode '^\[Result "1-0"\]$' | 
 frawk '{ s += 0ドル } END { print "w:" s }') \
 <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
 --crlf --no-unicode '^\[Result "0-1"\]$' | 
 frawk '{ s += 0ドル } END { print "b:" s }') \
 <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
 --crlf --no-unicode '^\[Result "1/2-1/2"\]$' | 
 frawk '{ s += 0ドル } END { print "d:" s }')

Finishes in 2.53 seconds versus OP's (20.05 seconds), jano's (19.71 seconds) and Adam's (6.48 seconds) solutions.

Potential Bugs

Your solution, while able to eliminate one piped command relative to Adam's, introduces a new limitation: the output may consists of multiple intermediate results rather than one final statistics. From the xargs man page:

-s max-chars, --max-chars=max-chars
 Use at most max-chars characters per command line, including the command 
 and initial-arguments and the terminating nulls at the ends of the 
 argument strings. The largest allowed value is system-dependent, and is 
 calculated as the argument length limit for exec, less the size of your 
 environment, less 2048 bytes of headroom. If this value is more than 
 128KiB, 128Kib is used as the default value; otherwise, the default 
 value is the maximum. 1KiB is 1024 bytes. xargs automatically adapts 
 to tighter constraints.

I've inadvertently hit the default maximum max-chars limit while testing. You may force the issue by:

creating a directory with a very long path name to hold the data set and running your commands from there,
increasing the size of your environment by populating it with random data:
```
export __=$(perl -e 'print "x" x $ARGV[0]' 100000)
```
decreasing the maximum stack size temporarily by running
```
ulimit -S -s STACK_SIZE
```
where STACK_SIZE is some number lower than the default 8192, e.g. 256,
letting the number of PGN files grow organically,

or some combination of the above.

I'm quite surprised that your attempt at parallel execution slowed you down. If I adapt your changes to Adam's code, it actually ranks among the fastest solutions I've tried. Here's what you'd have come up with:

# op2.sh
n=$(nproc)
find . -type f -name '*.pgn' -print0 | 
 xargs --null -n $n --max-procs=$n mawk -F '[-"]' \
 'BEGIN { a[1] = a[0] = a["1/2"] = 0 } 
 /Result/{ ++a[2ドル] } 
 END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }' |
 mawk '{ games += 1ドル; white += 2ドル; black += 3ドル; draw += 4ドル; } 
 END { print games, white, black, draw }'

Notice here I've added a BEGIN rule to initialize the array. This is done to ensure the statistics will be correct: if the collection of files fed to an invocation of awk is devoid of certain outcomes (white wins, white loses, or a draw), the slots counting the missing outcomes would output empty strings, which when passed to the final aggregating AWK program would result in misgrouping and tainting the counts. You can see this effect come into play by changing the -n4 option of xargs to -n1 or -n2 in Adam's solution reproduced below:

# adam.sh
find . -type f -name '*.pgn' -print0 | 
 xargs -0 -n4 -P4 mawk \
 '/Result/{ 
 split(0,ドル a, "-");
 res = substr(a[1], length(a[1]), 1);
 if (res == 1) white++; 
 if (res == 0) black++; 
 if (res == 2) draw++;
 } 
 END { print white + black + draw, white, black, draw }' | 
 mawk '{ games += 1ドル; white += 2ドル; black += 3ドル; draw += 4ドル; } 
 END { print games, white, black, draw }'

A fixed string search for Result, while fast and simple, is a bit naive. In a prior version of the file Britbase/197008bcf.pgn, there is a comment in the movetext that includes both the string Result and a score 1-0:

1. e4 c6 2. d4 g6 3. Nc3 Bg7 4. Be3 d6 5. Qd2 Nd7 6. f3 Qa5 7. Bc4 b5 8. Bb3 b4
9. Nd1 Ba6 10. Ne2 Ngf6 11. c3 c5 12. cxb4 cxb4 13. Nf2 O-O 14. O-O Rfc8 15.
Rfc1 Nb6 16. Nf4 Nc4 17. Bxc4 Bxc4 18. N4d3 Rab8 19. a3 Qb5 20. Nxb4 a5 21.
Nbd3 Nd7 22. b4 a4 23. Nb2 Bb3 24. Rxc8+ Rxc8 25. Rc1 Rxc1+ 26. Qxc1 Qe2 27.
Nfd1 {Result given as 1-0 in the tournament bulletin but it is clear from
published results and crosstables, as well as the position on the board (where
Black has mate in one) that the true result was 0-1 - JS, BritBase} 0-1

This is a rare occurrence, indeed, but I'd personally perfer the more explicit search pattern $'^\[Result "(1-0|0-1|1/2-1/2)"\]\r*$' that adheres to the PGN export format described in the PGN standard document. It is still possible to have false negatives, but trying to match the versatile PGN import format is likely to hurt performance and needlessly increase complexity.

Better Tools

Now onto my first observation at possible performance improvement ideas. I noticed the result lines we're interested in make up only a tiny fraction (less than 5%) of the contents in those PGN files, so I figured that filtering the input files for the result lines first via the expert tool grep might be more efficient than feeding them directly to awk, which does unnecessary bookkeeping that could accumulate considerable overhead. This was actually Adam's first optimization idea, too. I just spun it off by subsequently applying his last performance idea, finding faster replacements for the tools, which got me to ripgrep. ripgrep is an amazing piece of software that does parallel searches, which was Adam's second optimization idea, and I'm getting it for free!

Then I wondered, after all these years, is mawk still the fastest AWK? After 20 years of no activity, Mike Brennan, the author of mawk, released mawk 1.9.9.6, a beta release for mawk 2 based on mawk 1.3.3's code. Sadly, it didn't outperform mawk 1.3.4 in my testing. There is also frawk, a fast and mostly compatible AWK implementation written in Rust. It requires more steps to install, and in the end, with the exception of one very simple case of summing numbers (one per line), it runs slower in all backend, optimization and parallelization configurations.

After ample experimentation, I came up with two solutions that are blazingly fast on modern hardware:

# rg.sh
rg --glob='*.pgn' --text --no-filename --no-line-number --crlf --no-unicode \
 '^\[Result "(1-0|0-1|1/2-1/2)"\]$' | 
 mawk -F '[-"]' 'BEGIN { a[1] = a[0] = a["1/2"] = 0 } 
 { ++a[2ドル] }
 END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }'

# rg3.sh
frawk -F ':' '/w:/{ white = 2ドル } 
 /b:/{ black = 2ドル } 
 /d:/{ draw = 2ドル } 
 END { print white + black + draw, white, black, draw }' \
 <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
 --crlf --no-unicode '^\[Result "1-0"\]$' | 
 frawk '{ s += 0ドル } END { print "w:" s }') \
 <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
 --crlf --no-unicode '^\[Result "0-1"\]$' | 
 frawk '{ s += 0ドル } END { print "b:" s }') \
 <(rg --glob='*.pgn' --text --no-filename --count --line-buffered \
 --crlf --no-unicode '^\[Result "1/2-1/2"\]$' | 
 frawk '{ s += 0ドル } END { print "d:" s }')

Compression

Before I present my benchmark results, I want to touch on one more idea that might improve performance, especially in setups with slower disks: searching through compressed archives. This is essentially trading disk seeking time with CPU time. A reduction in file size and count (from many smaller files into one big, compressed tar file) means a reduction in disk IO, which translates to faster execution time if it is the right trade-off.

We may download the data set on GitHub as a gzip file and decompress it using pigz (parallel gzip):

tar --extract --use-compress-program='pigz' \
 --file='./ChessData-master.tar.gz'

This runs in 45 seconds on my machine. GZIP is not the most optimal lossless data compression file format, neither is pigz the most parallel decompressor. We could switch to Zstandard by recompressing the data set with pzstd:

tar --create --use-compress-program='pzstd' \
 --file='./ChessData-master.tar.zst' ./ChessData-master

This runs in 30 seconds on my machine and produces a 3.5 GB zstd file, which is 200 MB down from the 3.7 GB gzip source. Note that pzstd can not fully benefit from parallel decompression if the zstd file is compressed using zstd, as zstd does not insert the additional markers needed for parallel decompression.

A solution to the problem statement utilizing pzstd is:

# pzstd.sh
tar --extract --to-stdout --use-compress-program='pzstd' \
 --wildcards --wildcards-match-slash \
 --file='./ChessData-master.tar.zst' '*.pgn' | 
 mawk -F '[-"]' 'BEGIN { a[1] = a[0] = a["1/2"] = 0 } 
 /Result/{ ++a[2ドル] } 
 END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }'

The Benchmarks

Benchmark PC1 specs:

CPU: 8
RAM: 16 GB
Disk: NVMe SSD
OS: Fedora 40

Data sets:

ChessData-master
LumbrasGigaBase (pre-1899 to 2023, Version 2024年04月09日)

Benchmark types:

Normal: before each run, instruct the OS to drop caches.

sync
echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null

Cached: do not drop caches before each run.
POSIX commands: only POSIX utilities are allowed; substitute mawk and frawk with awk (gawk), and rg with grep. Due to incompatibilities in the command-lines, rg (ripgrep) could not simply be replaced with grep, so rg.sh and rg3.sh had to be rewritten as grep.sh and grep3.sh.

Note: adam2.sh is adam.sh patched to utilize all available cores (-n4 -P4 becomes -n$n -P$n where n=$(nproc). I didn't test for the best possible value for -n as that would be ad hoc micro-optimization, so I just made it the same as the number for -P.

ChessData-master ChessData-master, cached ChessData-master, POSIX commands LumbrasGigaBase

Let's examine the median running times of each contender in the benchmarks.

ChessData-master:

rg3.sh 2.53
rg.sh 3.25
op2.sh 3.67
adam2.sh 4.00
adam.sh 6.48
janos.sh 19.71
op.sh 20.05
pzstd.sh 25.88

ChessData-master, cached:

rg3.sh 0.91
rg.sh 1.69
op2.sh 2.04
adam2.sh 2.41
adam.sh 4.35
op.sh 12.52
janos.sh 13.11
pzstd.sh 21.54

ChessData-master, POSIX commands:

op2.sh 3.74
adam2.sh 3.94
adam.sh 6.54
grep.sh 14.09
janos.sh 19.49
op.sh 19.90
grep3.sh 25.02
pzstd.sh 55.96

LumbrasGigaBase:

rg3.sh 4.02
rg.sh 4.47
adam.sh 12.09
op2.sh 14.20
adam2.sh 15.82
op.sh 20.76
janos.sh 21.26
pzstd.sh 24.36

The clear winner is rg3.sh.

Question 12

There's an easily-fixed bug if not all the possible results are encountered by the program. Instead of printing empty string for missing results, we should force them to be converted to number 0 using unary + operator:

print +a["1"]+a["0"]+a["1/2"], +a["1"], +a["0"], +a["1/2"]

A more serious bug is caused by passing all the file names via xargs. If it needs to run more than one AWK process due to command-line length, we will get output lines from each process, which we'll need to accumulate using something like this:

| mawk '{ g+=1ドル; b+=2ドル; w+=3ドル; d+=4ドル } END { print g, b, w, d }'

Alternatively, use grep commands to extract the result lines from the files, and pipe that stream into a single AWK process. And we can make it faster by not splitting each line - just count the whole lines and only split when we're reading the count. At this point we probably want a separate file for the AWK program:

#!/usr/bin/mawk -f
{ ++count[0ドル] }
END {
 for (i in count) split(i, r, "[-\"]") && a[r[2]] += count[i];
 delete a["*"]; # ignore unfinished games
 for (i in a) total += a[i];
 print +total, +a["1"], +a["0"], +a["1/2"];
}

find . -type f -name '*.pgn' -exec grep -h '^\[Result ' {} + | ./78086.awk

Note that some of the corpus files have carriage-return characters causing excess entries in the count array - that's why I use += count[i] rather than simple assignment (still faster than filtering the grep output through tr -d '\r', in my tests).

Splitting these commands shows that AWK is now no longer the bottleneck:

$ time find . -type f -name '*.pgn' -exec grep -h '^\[Result ' {} + >78086.in
real 0m6.985s
user 0m5.230s
sys 0m1.218s
$ time ./78086.awk <78086.in
10316025 3956979 3035935 3323111
real 0m0.632s
user 0m0.616s
sys 0m0.016s

Any further work ought to focus on creating a faster replacement for the grep command.

Question 13

Interesting. With your modifications, I've almost closed the speed gap between rg.sh and rg3.sh.

Question 14

@Gao, I wrote an "unordered prefix-fgrep" program: Find all lines with given prefix from a list of files.

janos janos 113k15 gold badges154 silver badges396 bronze badges · Accepted Answer · 2015-01-20 18:45:00Z

4

\$\begingroup\$

I don't see the reason for chaining with -print0 | xargs -0. It's better and simpler to use -exec:

find . -type f -name '*.pgn' -exec mawk -F '[-"]' '...' {} +

I don't see a way to make the AWK code faster, but:

Some of the double-quoting is unnecessary
I would add a space around operators for somewhat better readability
A semicolon can be dropped

Like this:

/Result/ { ++a[2ドル] }
 END { print a[1] + a[0] + a["1/2"], a[1], a[0], a["1/2"] }

Share

edited Jan 20, 2015 at 21:09

answered Jan 20, 2015 at 18:45

janos's user avatar

janos janos

113k15 gold badges154 silver badges396 bronze badges

\$\endgroup\$

3

2

\$\begingroup\$ -exec \; runs the command once for every file so that will be many more awk processes being run. -exec + will behave more like xargs and bunch them up so that might be an improvement but I doubt it will be a significant one as the single process is likely not a major contributing factor to the runtime. \$\endgroup\$

Etan Reisner
– Etan Reisner

2015年01月20日 20:56:11 +00:00
Commented Jan 20, 2015 at 20:56
\$\begingroup\$ @EtanReisner You're right, I changed \; to + \$\endgroup\$

janos
– janos

2015年01月20日 20:59:08 +00:00
Commented Jan 20, 2015 at 20:59
1

\$\begingroup\$ I kept all array keys quoted for consistency and legibility. I doubt it matters for performance; but if it does, I would guess that explicit strings are faster (integer keys would have to be coerced to strings, then converted to hashes for associative array lookup). It's just the final END calculation so it's not going to be measurable. \$\endgroup\$

tripleee
– tripleee

2015年01月21日 04:36:48 +00:00
Commented Jan 21, 2015 at 4:36

Add a comment |

Stack Exchange Network

Shell script to count chess game outcomes

4 Answers 4

TL;DR

Potential Bugs

Better Tools

Compression

The Benchmarks

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Shell script to count chess game outcomes

4 Answers 4

TL;DR

Potential Bugs

Better Tools

Compression

The Benchmarks

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions