The lmtool builds a consistent set of lexical and language model files for decoders. The target decoders are the Sphinx family, though any system that can read ARPA-format files can use them.
Currently lmtool is configured for the English language (and its American dialect in particular). If you upload a corpus in a different language, your output is likely unpredictable. We are working on this. The current version does not deal gracefully with Unicode; this is also being worked on.
FIRST, CREATE A CORPUS FILE consisting of all sentences you would like the decoder to recognize. The sentences should be one to a line (but should not have punctuation symbols). You may not need to exhastively list all possible sentences: the decoder will allow fragments to recombine into new sentences; but the sentences you provide will be preferred. For example:
THIS IS AN EXAMPLE SENTENCE EACH LINE IS SOMETHING THAT YOU'D WANT YOUR SYSTEM TO RECOGNIZE ACRONYMS PRONOUNCED AS LETTERS ARE BEST ENTERED AS A T_L_A NUMBERS AND ABBREVIATIONS OUGHT TO BE SPELLED OUT FOR EXAMPLE TWO HUNDRED SIXTY THREE ET CETERA YOU CAN UPLOAD A FEW THOUSAND SENTENCES BUT THERE IS A LIMIT
[26 january 2010]
Version 3 is now ready for public
use. lmtool has been reorganized internally to make
use of
the Logios
package. This will make lmtool easier to maintain in the future and
will allow it to take advantage of ongoing development in
Logios. These changes should be transparent to regular users. Please
give it a try. If you have any problems, or discover bugs, let the
maintainer know. If things look good (i.e., I stop getting bug
reports) this will become the standard version.
NOTE: If you have automated the use of this tool you will need to update your code. The main difference is that the name of the target script has changed. The old script will still be available so nothing will break immediately, but it's unlikely to continue to be maintained. Also, file links are no longer tagged in the html. Please let me know if you make use of this feature and I'll find a fix.