Software> Stanford Log-linear Part-Of-Speech Tagger

About

A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'. This software is a Java implementation of the log-linear part-of-speech taggers described in these papers (if citing just one paper, cite the 2003 one):

Kristina Toutanova and Christopher D. Manning. 2000. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63-70.

Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003, pp. 252-259.

The tagger was originally written by Kristina Toutanova. Since that time, Dan Klein, Christopher Manning, William Morgan, Anna Rafferty, Michel Galley, and John Bauer have improved its speed, performance, usability, and support for other languages.

The system requires Java 8+ to be installed. Depending on whether you're running 32 or 64 bit Java and the complexity of the tagger model, you'll need somewhere between 60 and 200 MB of memory to run a trained tagger (i.e., you may need to give Java an option like java -mx200m). Plenty of memory is needed to train a tagger. It again depends on the complexity of the model but at least 1GB is usually needed, often more.

Current downloads contain three trained tagger models for English, two each for Chinese and Arabic, and one each for French, German, and Spanish. The tagger can be retrained on any language, given POS-annotated training text for the language.

Part-of-speech name abbreviations: The English taggers use the Penn Treebank tag set. Here are some links to documentation of the Penn Treebank English POS tag set: 1993 Computational Linguistics article in PDF, Chameleon Metadata list (which includes recent additions to the set). The French, German, and Spanish models all use the UD (v2) tagset. See the included README-Models.txt in the models directory for more information about the tagset for each language.

The tagger code is dual licensed (in a similar manner to MySQL, etc.). The tagger is licensed under the GNU General Public License (v2 or later), which allows many free uses. Source is included. The package includes components for command-line invocation, running as a server, and a Java API. For distributors of proprietary software, commercial licensing is available. If you don't need a commercial license, but would like to support maintenance of these tools, we welcome gift funding.

Questions

For documentation, first take a look at the included README.txt.

Galal Aly wrote a tagging tutorial focused on usage in Java with Eclipse.

For more details, look at our included javadocs, particularly the javadoc for MaxentTagger.

There is a FAQ.

Matthew Jockers kindly produced an example and tutorial for running the tagger. This particularly concentrates on command-line usage with XML and (Mac OS X) xGrid.

Have a support question? Ask us on Stack Overflow using the tag stanford-nlp.

Feedback and bug reports / fixes can be sent to our mailing lists.

Recipes

Tag text from a file text.txt, producing tab-separated-column output:

java -cp "*" edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/english-left3words-distsim.tagger -textFile text.txt -outputFormat tsv -outputFile text.tag

Mailing Lists

We have 3 mailing lists for the Stanford POS Tagger, all of which are shared with other JavaNLP tools (with the exclusion of the parser). Each address is at @lists.stanford.edu:

java-nlp-user This is the best list to post to in order to send feature requests, make announcements, or for discussion among JavaNLP users. (Please ask support questions on Stack Overflow using the stanford-nlp tag.)
You have to subscribe to be able to use this list. Join the list via this webpage or by emailing java-nlp-user-join@lists.stanford.edu. (Leave the subject and message body empty.) You can also look at the list archives.
java-nlp-announce This list will be used only to announce new versions of Stanford JavaNLP tools. So it will be very low volume (expect 1-3 messages a year). Join the list via this webpage or by emailing java-nlp-announce-join@lists.stanford.edu. (Leave the subject and message body empty.)
java-nlp-support This list goes only to the software maintainers. It's a good address for licensing questions, etc. For general use and support questions, you're better off joining and using java-nlp-user. You cannot join java-nlp-support, but you can mail questions to java-nlp-support@lists.stanford.edu.

Download

Download Stanford Tagger version 4.2.0 [75 MB]

The full download is a 75 MB zipped file including models for English, Arabic, Chinese, French, Spanish, and German. If you unpack the tar file, you should have everything needed. This software provides a GUI demo, a command-line interface, and an API. Simple scripts are included to invoke the tagger. For more information on use, see the included README.txt.

Extensions

Other models for the Stanford Tagger

Twitter English: An English Twitter POS tagger model is available by Leon Derczynski and others at Sheffield.

Packages for using the Stanford POS tagger from other programming languages (by other people)

Docker: Cuzzo Yahn provides a docker image for the Stanford POS tagger with the XMLRPC service (docker registry).
F#/C#/.NET: Sergey Tihon has ported the Stanford POS tagger to F# (.NET), using IKVM. See his blog post.
GATE: GATE includes a Stanofrd POS tagger plugin and the GATE team at the University of Sheffield produced a Twitter tagger model and tagged data set compatible with version 3.3.1.
Go: Kamil Drążkiewicz wrote Go-Stanford-NLP as an interface to the Stanford POS tagger.
Javascript (node.js): Cuzzo Yahn wrote a node.js client for interacting with the Stanford POS tagger, using the XML-RPC service (npm page). Ralf Engelschall wrote another: Stanford-POSTagger.
Matlab: József Vass makes available on GitHub a good package for using the Stanford POS Tagger in MatLab. Earlier, Utkarsh Upadhyay also provided a Matlab function for accessing the Stanford POS tagger. But note that it loads the tagger each time it is called, and you don't want to do that! You should load the tagger only once and then re-use it. Rojbir Pabla also contributed a simple script, which is on the MathWorks site.
PHP: Patrick Schur in 2017 wrote PHP wrapper for Stanford POS and NER taggers. Also on packagist. Other choices: PHP wrapper by Anthony Gentile; PHP wrapper by Charles Hays (on github).
Python: 2020s advice: You should always use a Python interface to the CoreNLPServer for performant use in Python. For NLTK, use the nltk.parse.corenlp module. Historically, NLTK (2.0+) contains an interface to the Stanford POS tagger. The original version was written by Nitin Madnani: documentation (note: in old versions, manually set the character encoding or you get ASCII!), code, on Github. After a while there was a better CoreNLPPOSTagger class.
Ruby: tiendung has written a Ruby Binding for the Stanford POS tagger and Named Entity Recognizer.
XML-RPC: Ali Afshar wrote an XML-RPC service interface to the Stanford POS tagger.

Release History

Version	Date	Description
4.2.0	2020年11月17日	Add currency data for English models Full
4.1.0	2020年08月06日	Missing tagger extractor class added, Spanish tokenization improvements Full
4.0.0	2020年04月19日	Model tokenization updated to UDv2.0 Full
3.9.2	2018年10月16日	New English models, better currency symbol handling English / Full
3.9.1	2018年02月27日	new French UD model English / Full
3.8.0	2017年06月09日	new Spanish and French UD models English / Full
3.7.0	2016年10月31日	Update for compatibility, German UD model English / Full
3.6.0	2015年12月09日	Updated for compatibility English / Full
3.5.2	2015年04月20日	Updated for compatibility English / Full
3.5.1	2015年01月29日	General bugfixes English / Full
3.5.0	2014年10月26日	Upgrade to Java 8 English / Full
3.4.1	2014年08月27日	Add Spanish model English / Full
3.4	2014年06月16日	French model uses CC tagset English / Full
3.3.1	2014年01月04日	Bugfix release English / Full
3.3.0	2013年11月12日	imperatives included in English model English / Full
3.2.0	2013年06月20日	improved speed & size of all models English / Full
3.1.5	2013年04月04日	ctb7 model, -nthreads option, improved speed English / Full
3.1.4	2012年11月11日	Improved Chinese model English / Full
3.1.3	2012年07月09日	Minor bug fixes English / Full
3.1.2	2012年05月22日	Included some "tech" words in the latest model English / Full
3.1.1	2012年03月09日	Caseless models added for English English / Full
3.1.0	2012年01月06日	French tagger added, tagging speed improved English / Full
3.0.4	2011年09月14日	Compatible with other recent Stanford releases. English / Full
3.0.3	2011年06月19日	Compatible with other recent Stanford releases. English / Full
3.0.2	2011年05月15日	Addition of TSV input format. English / Full
3.0.1	2011年04月20日	Faster Arabic and German models. Compatible with other recent Stanford releases. English / Full
3.0	2010年05月21日	Tagger is now re-entrant. New tagger objects are loaded with tagger = new MaxentTagger(path) and then used with tagger.tagMethod... English / Full
2.0	2009年12月24日	An order of magnitude faster, slightly more accurate best model, more options for training and deployment. English / Full
1.6	2008年09月28日	A fraction better, a fraction faster, more flexible model specification, and quite a few less bugs. English / Full
1.5.1	2008年06月06日	Tagger properties are now saved with the tagger, making taggers more portable; tagger can be trained off of treebank data or tagged text; fixes classpath bugs in 2 June 2008 patch; new foreign language taggers released on 7 July 2008 and packaged with 1.5.1. English / Full / Updated models
1.5	2008年05月21日	Added taggers for several languages, support for reading from and writing to XML, better support for changing the encoding, distributional similarity options, and many more small changes; patched on 2 June 2008 to fix a bug with tagging pre-tokenized text. English / Full
1.0	2006年01月10日	First cleaned-up release after Kristina graduated. Old School
0.1	2004年08月16日	First release.