Wednesday, September 16, 2015
Coding Update
After about two weeks of spare-time effort, I've translated about 75% of my core processing into Python. The result is about 6K SLOC. The Common Lisp version is about 14K SLOC, but it includes a lot of "dead code" that I edited out while translating and the remaining 25% of functionality.
Most of the translated functions have validated exactly against the Common Lisp originals, but one of the core algorithms is producing different (but still valid looking) numbers. I spent a little time trying to debug the difference but couldn't find any immediate culprits. So (after some fits and starts) I processed the entire historical game database through the new Python code and used that to train a model. The model had the same performance and accuracy as the model built off the Common Lisp processed data, so apparently the differences are irrelevant to the model. I suspect the Python version is probably "more correct" than the Common Lisp version, because this is a matrix-manipulation heavy part of the code, and it is expressed much more succinctly and clearly in Python.
At some point I'm going to have to decide whether I want to carry the Play-By-Play processing code forward. I spent a lot of time on the (original) code, and you can pull some interesting data out of the Play-By-Play. On the other hand, the coverage is poor (especially before about 2012) and the data is full of errors. Many (most?) games have statistics that don't agree between the Play-By-Play data and the box score. A surprising number of games have different final scores. So it's hard to put a lot of faith in the data.
Most of the translated functions have validated exactly against the Common Lisp originals, but one of the core algorithms is producing different (but still valid looking) numbers. I spent a little time trying to debug the difference but couldn't find any immediate culprits. So (after some fits and starts) I processed the entire historical game database through the new Python code and used that to train a model. The model had the same performance and accuracy as the model built off the Common Lisp processed data, so apparently the differences are irrelevant to the model. I suspect the Python version is probably "more correct" than the Common Lisp version, because this is a matrix-manipulation heavy part of the code, and it is expressed much more succinctly and clearly in Python.
At some point I'm going to have to decide whether I want to carry the Play-By-Play processing code forward. I spent a lot of time on the (original) code, and you can pull some interesting data out of the Play-By-Play. On the other hand, the coverage is poor (especially before about 2012) and the data is full of errors. Many (most?) games have statistics that don't agree between the Play-By-Play data and the box score. A surprising number of games have different final scores. So it's hard to put a lot of faith in the data.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.
[フレーム]