Python package for handwriting recognition

Question 1

I am currently writing my bachelor's thesis about on-line handwriting recognition. This is not OCR, because I have the information how a symbol is written as a list of pen trajectory coordinates (x, y).

I call this hwrt - handwriting recognition toolkit. It has a documentation and a friend of mine got the "first steps" to work on his computer.

However, it is the first time I wrote a Python package of which I hope that others might use it. I hope to get general feedback about this project.

The project is hosted at GitHub and has the following structure:

.
├── bin
├── dist
├── docs
├── hwrt
│  ├── misc
│  └── templates
└── tests
 └── symbols

I has some nosetests (not enough, I am working on it).

One of the files in bin is view.py. It allows users to take a look at the data they have previously downloaded (see "first steps" in my documentation).

setup.py

try:
 from setuptools import setup
except ImportError:
 from distutils.core import setup
config = {
 'name': 'hwrt',
 'version': '0.1.125',
 'author': 'Martin Thoma',
 'author_email': '[email protected]',
 'packages': ['hwrt'],
 'scripts': ['bin/backup.py', 'bin/view.py', 'bin/download.py',
 'bin/test.py', 'bin/train.py', 'bin/analyze_data.py',
 'bin/hwrt', 'bin/record.py'],
 'package_data': {'hwrt': ['templates/*', 'misc/*']},
 'url': 'https://github.com/MartinThoma/hwrt',
 'license': 'MIT',
 'description': 'Handwriting Recognition Tools',
 'long_description': """A tookit for handwriting recognition. It was
 developed as part of the bachelors thesis of Martin Thoma.""",
 'install_requires': [
 "argparse",
 "theano",
 "nose",
 "natsort",
 "PyYAML",
 "matplotlib",
 "shapely"
 ],
 'keywords': ['HWRT', 'recognition', 'handwriting', 'on-line'],
 'download_url': 'https://github.com/MartinThoma/hwrt',
 'classifiers': ['Development Status :: 3 - Alpha',
 'Environment :: Console',
 'Intended Audience :: Developers',
 'Intended Audience :: Science/Research',
 'License :: OSI Approved :: MIT License',
 'Natural Language :: English',
 'Programming Language :: Python :: 2.7',
 'Programming Language :: Python :: 3',
 'Topic :: Scientific/Engineering :: Artificial Intelligence',
 'Topic :: Software Development',
 'Topic :: Utilities'],
 'zip_safe': False,
 'test_suite': 'nose.collector'
}
setup(**config)

view.py

#!/usr/bin/env python
"""
Display a recorded handwritten symbol as well as the preprocessing methods
and the data multiplication steps that get applied.
"""
import sys
import os
import logging
logging.basicConfig(format='%(asctime)s %(levelname)s %(message)s',
 level=logging.DEBUG,
 stream=sys.stdout)
import yaml
try: # Python 2
 import cPickle as pickle
except ImportError: # Python 3
 import pickle
# My modules
import hwrt
from hwrt import HandwrittenData
sys.modules['HandwrittenData'] = HandwrittenData
import hwrt.utils as utils
import hwrt.preprocessing as preprocessing
import hwrt.features as features
import hwrt.data_multiplication as data_multiplication
def _fetch_data_from_server(raw_data_id):
 """Get the data from raw_data_id from the server.
 :returns: The ``data`` if fetching worked, ``None`` if it failed."""
 import MySQLdb
 import MySQLdb.cursors
 # Import configuration file
 cfg = utils.get_database_configuration()
 if cfg is None:
 return None
 # Establish database connection
 connection = MySQLdb.connect(host=cfg[args.mysql]['host'],
 user=cfg[args.mysql]['user'],
 passwd=cfg[args.mysql]['passwd'],
 db=cfg[args.mysql]['db'],
 cursorclass=MySQLdb.cursors.DictCursor)
 cursor = connection.cursor()
 # Download dataset
 sql = ("SELECT `id`, `data` "
 "FROM `wm_raw_draw_data` WHERE `id`=%i") % raw_data_id
 cursor.execute(sql)
 return cursor.fetchone()
def _get_data_from_rawfile(path_to_data, raw_data_id):
 """Get a HandwrittenData object that has ``raw_data_id`` from a pickle file
 ``path_to_data``.
 :returns: The HandwrittenData object if ``raw_data_id`` is in
 path_to_data, otherwise ``None``."""
 loaded = pickle.load(open(path_to_data))
 raw_datasets = loaded['handwriting_datasets']
 for raw_dataset in raw_datasets:
 if raw_dataset['handwriting'].raw_data_id == raw_data_id:
 return raw_dataset['handwriting']
 return None
def _list_ids(path_to_data):
 """List raw data IDs grouped by symbol ID from a pickle file
 ``path_to_data``."""
 loaded = pickle.load(open(path_to_data))
 raw_datasets = loaded['handwriting_datasets']
 raw_ids = {}
 for raw_dataset in raw_datasets:
 raw_data_id = raw_dataset['handwriting'].raw_data_id
 if raw_dataset['formula_id'] not in raw_ids:
 raw_ids[raw_dataset['formula_id']] = [raw_data_id]
 else:
 raw_ids[raw_dataset['formula_id']].append(raw_data_id)
 for symbol_id in sorted(raw_ids):
 print("%i: %s" % (symbol_id, sorted(raw_ids[symbol_id])))
def _get_description(prev_description):
 """Get the parsed description file (a dictionary) from another
 parsed description file."""
 current_desc_file = os.path.join(utils.get_project_root(),
 prev_description['data-source'],
 "info.yml")
 if not os.path.isfile(current_desc_file):
 logging.error("You are probably not in the folder of a model, because "
 "%s is not a file.", current_desc_file)
 sys.exit(-1)
 with open(current_desc_file, 'r') as ymlfile:
 current_description = yaml.load(ymlfile)
 return current_description
def _get_system(model_folder):
 """Return the preprocessing description, the feature description and the
 model description."""
 # Get model description
 model_description_file = os.path.join(model_folder, "info.yml")
 if not os.path.isfile(model_description_file):
 logging.error("You are probably not in the folder of a model, because "
 "%s is not a file. (-m argument)",
 model_description_file)
 sys.exit(-1)
 with open(model_description_file, 'r') as ymlfile:
 model_desc = yaml.load(ymlfile)
 # Get the feature and the preprocessing description
 feature_desc = _get_description(model_desc)
 preprocessing_desc = _get_description(feature_desc)
 return (preprocessing_desc, feature_desc, model_desc)
def display_data(raw_data_string, raw_data_id, model_folder):
 """Print ``raw_data_id`` with the content ``raw_data_string`` after
 applying the preprocessing of ``model_folder`` to it."""
 print("## Raw Data (ID: %i)" % raw_data_id)
 print("```")
 print(raw_data_string)
 print("```")
 preprocessing_desc, feature_desc, _ = _get_system(model_folder)
 # Print model
 print("## Model")
 print("%s\n" % model_folder)
 # Print preprocessing queue
 print("## Preprocessing")
 print("```")
 tmp = preprocessing_desc['queue']
 preprocessing_queue = preprocessing.get_preprocessing_queue(tmp)
 for algorithm in preprocessing_queue:
 print("* " + str(algorithm))
 print("```")
 feature_list = features.get_features(feature_desc['features'])
 input_features = sum(map(lambda n: n.get_dimension(), feature_list))
 print("## Features (%i)" % input_features)
 print("```")
 for algorithm in feature_list:
 print("* %s" % str(algorithm))
 print("```")
 # Get Handwriting
 recording = HandwrittenData.HandwrittenData(raw_data_string,
 raw_data_id=raw_data_id)
 # Get the preprocessing queue
 tmp = preprocessing_desc['queue']
 preprocessing_queue = preprocessing.get_preprocessing_queue(tmp)
 recording.preprocessing(preprocessing_queue)
 # Get feature values as list of floats, rounded to 3 decimal places
 tmp = feature_desc['features']
 feature_list = features.get_features(tmp)
 feature_values = recording.feature_extraction(feature_list)
 feature_values = [round(el, 3) for el in feature_values]
 print("Features:")
 print(feature_values)
 # Get the list of data multiplication algorithms
 mult_queue = data_multiplication.get_data_multiplication_queue(
 feature_desc['data-multiplication'])
 # Multiply traing_set
 training_set = [recording]
 for algorithm in mult_queue:
 new_trning_set = []
 for recording in training_set:
 samples = algorithm(recording)
 for sample in samples:
 new_trning_set.append(sample)
 training_set = new_trning_set
 # Display it
 for recording in training_set:
 recording.show()
def get_parser():
 """Return the parser object for this script."""
 from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter
 parser = ArgumentParser(description=__doc__,
 formatter_class=ArgumentDefaultsHelpFormatter)
 parser.add_argument("-i", "--id", dest="id", default=292293,
 type=int,
 help="which RAW_DATA_ID do you want?")
 parser.add_argument("--mysql", dest="mysql", default='mysql_online',
 help="which mysql configuration should be used?")
 parser.add_argument("-m", "--model",
 dest="model",
 help="where is the model folder (with a info.yml)?",
 metavar="FOLDER",
 type=lambda x: utils.is_valid_folder(parser, x),
 default=utils.default_model())
 parser.add_argument("-l", "--list",
 dest="list",
 help="list all raw data IDs / symbol IDs",
 action='store_true',
 default=False)
 parser.add_argument("-s", "--server",
 dest="server",
 help="contact the MySQL server",
 action='store_true',
 default=False)
 return parser
if __name__ == '__main__':
 args = get_parser().parse_args()
 if args.list:
 preprocessing_desc, _, _ = _get_system(args.model)
 raw_datapath = os.path.join(utils.get_project_root(),
 preprocessing_desc['data-source'])
 _list_ids(raw_datapath)
 else:
 if args.server:
 data = _fetch_data_from_server(args.id)
 print("hwrt version: %s" % hwrt.__version__)
 display_data(data['data'], data['id'], args.model)
 else:
 logging.info("RAW_DATA_ID %i does not exist or "
 "database connection did not work.", args.id)
 # The data was not on the server / the connection to the server did
 # not work. So try it again with the model data
 preprocessing_desc, _, _ = _get_system(args.model)
 raw_datapath = os.path.join(utils.get_project_root(),
 preprocessing_desc['data-source'])
 handwriting = _get_data_from_rawfile(raw_datapath, args.id)
 if handwriting is None:
 logging.info("Recording with ID %i was not found in %s",
 args.id,
 raw_datapath)
 else:
 print("hwrt version: %s" % hwrt.__version__)
 display_data(handwriting.raw_data_json,
 handwriting.formula_id,
 args.model)

As I wrote, I would like to get general Feedback about the project. However, I am not experienced with packaging, so I copied the setup.py. I am especially not sure if my choice zip_safe: False was correct.

I think I follow PEP8 everywhere and I use pylint to improve my code. However, for view.py I don't understand the following style warnings / I don't know how to fix them (in a good way):

W:115, 4: Redefining name 'preprocessing_desc' from outer scope (line 218) (redefined-outer-name)
W:128, 4: Redefining name 'preprocessing_desc' from outer scope (line 218) (redefined-outer-name)
R:120, 0: Too many local variables (19/15) (too-many-locals)
C:216, 4: Invalid constant name "args" (invalid-name)
C:218, 8: Invalid constant name "preprocessing_desc" (invalid-name)
C:219, 8: Invalid constant name "raw_datapath" (invalid-name)
C:224,12: Invalid constant name "data" (invalid-name)
C:232,12: Invalid constant name "preprocessing_desc" (invalid-name)
C:233,12: Invalid constant name "raw_datapath" (invalid-name)
C:235,12: Invalid constant name "handwriting" (invalid-name)

Question 2

Have you read pylint-messages.wikidot.com

Question 3

@jonrsharpe: Yes, I have seen that page. However, form "Redefining name %s from outer scope" they don't have a wiki entry. And I know where the other warnings come from, but I don't know good ways to fix them. For example, I could change args to ARGS and it should be fixed. However, I have never seen that. Additionally, the optparse documentation also calls it args, so I think that I shouldn't name it ARGS.

Question 4

I suppose most of the warnings would go away if you moved the code from under if __name__ == '__main__': into a function.

Question 5

@JanneKarila: Oh, you're right! Thank you!

Question 6

You can also disable specific warnings with comments, e.g. # pylint: disable=redefined-outer-name

Question 7

Overall I think this is good quality, if a little dense code, by which I mean that since there is lots of functionality there, it's not the easiest code to follow. One suggestion I have for that is more grouping according to topics. So e.g. the utilities could just have a couple of sections for files, string formatting, calling external programs an so on.

Since you work with a lot of external files, I imagine some sort of wrapper object to handle a project or so could work as well.

(Frankly, I may just submit pull requests now, that's easier, heh.)

Code

While running the tests, I've found that the import of open from future.builtins doesn't work (Python 2.7.9), since there is no such module / future_builtins doesn't have open either, which means that nntoolkit can't be loaded and serve.py:19 also throws an error. I don't what causes this, as you already have Travis CI setup, I'll see if I can get to the root cause of that.

IMO pickle isn't the nicest format for longterm data files; however at this point this is more of a reflex for me, if it works for you, than sure, why not (although you already have at least one workaround, the sys.modules part, so keep that in mind I think).

For speed you can use ujson as a drop-in replacement of json.

In data_analyzation_metrics.py:119 a raw escape string for color(?) is used. I'd additionally want a global flag to disable colors and it would be good to use a library for the formatting (I saw colorterm and termcolor; there may be others).

For something like "%s" % str(x) the str isn't necessary.

You're already doing this in some places, so I'd suggest using with open("foo") as file: all the time (if possible).

Instead of self.__repr__() repr(self) looks cleaner.

In features.py:174 the factors 2 and 3 should be extracted, e.g. something like draw_width = 3 if self.pen_down else 2 or so; in general extracting common subexpressions (even len(x)) can eliminate a lot of code, so I won't look for other examples here.

In general, if you have if foo: return, you don't need an else, just remove that indentation; also returning early can remove lots of indentation.

For HandwrittenData.py:208 the whole method could just be reduce to:

def __eq__(self, other):
 return isinstance(other, self.__class__) \
 and self.__dict__ == other.__dict__

euclidean_distance from preprocessing.py:30 is also defined as scipy.spatial.distance.euclidean, so if in some cases you run that over more than a few elements you could consider using that.

preprocessing.py:497 could be neater with for index, point in enumerate(pointlist):.

which in selfcheck.py should already exist somewhere ...?

Instead of using zip you can also use itertools.izip, the generator variant, to use less memory when possible, i.e. when only iterating over something with a for loop instead of storing the resulting list.

Packaging

The setup.py has a version number, but the Git repository has no corresponding tags/releases. If you start now and add e.g. 0.1.207 as the first release (or so), that'd make it easier to refer to specific version, i.e. from install scripts.

The long_description has a newline, that might look weird in some circumstances.

install_requires lists at least one package which is not in the requirements.txt, so I'd sync those, unless it's really only required for installation, in that case this point is moot.

The keywords look good, except that I'd doubt that including 'HWRT' helps if the package is already named that way~~(削除) , and 'on-line' probably doesn't need the hyphen (削除ここまで)~~(it does, see comments).

The classifiers are good; you could also list more specific versions of Python 3.

The requirements.txt has no version requirements. I'd say that at least some lower bound (i.e. your currently installed packages or so) would be useful to have. Of course you might not precisely know which versions to go for, but for someone trying to get it running this would be helpful nonetheless.

I don't exactly know the process for external requirements, but you could just note somewhere that ImageMagick is a dependency.

You also fixed the PEP8 stuff, so now there's only a few too-long lines left; you could also add the pep8 as a pre-commit hook, so you wouldn't be able to check in without everything fixed; I use that for library code at least. Same goes for pylint; I'd maybe disable a few of those things (and add it to the Makefile or again as a pre-commit hook).

Tests

Great! There is a bit of duplication, e.g. compare_pointlists is implemented three times. If possible I'd move that into (yet another, heh) utils package just to get it out of the way.

Using nosetests and the addition of the Makefile is nice as well.

Future ideas

Well, I like PostgreSQL, so I think this will come up at some point; if you don't have a pressing reason to use MySQL exclusively, using a database independent library would be cool.

Question 8

Woooha, this is awesome! Thank you for your detailed feedback! It will take me some time to go through it and I'll probably come back and ask questions in the comments, but I just wanted to let you know that I really like your feedback.

Question 9

Glad to help. High-level architectural suggestions are unfortunately much more difficult and I skipped on that, so maybe someone else will add something for that.

Question 10

"'on-line' probably doesn't need the hyphen" - It does need the hyphen. It is a very specifc term. In this case on-line does not mean like web, but in contrast to OCR. (I've just improved the tests by factoring test helpers out. I also added my first git tag :-) )

ferada ferada 11.4k25 silver badges65 bronze badges · Accepted Answer · 2014-12-16 17:51:01Z