A Java 8 implementation of the Rapid Automatic Keyword Extraction (RAKE) algorithm as described in: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic Keyword Extraction from Individual Documents. In M. W. Berry & J. Kogan (Eds.), Text Mining: Theory and Applications: John Wiley & Sons.
The implementation is based on the python one from https://github.com/aneesha/RAKE (however some changes have been made) The source code is released under the GPL V3License.
Add this repository to your POM.XML whether you want to use it with maven
<repository> <id>galan-maven-repo</id> <name>galan-maven-repo-releases</name> <url>http://galan.ehu.es/artifactory/ext-release-local</url> </repository>
This implementation requires a POS tagger to be used in order to work. For example The Illinois POS tagger could be used for English.
http://cogcomp.cs.illinois.edu/page/software_view/POS
For Spanish or other languages:
FreeLing --> http://nlp.lsi.upc.edu/freeling/
or Standford Pos tagger --> http://nlp.stanford.edu/software/tagger.shtml
The implementation is in beta state
TODO:
- More testing
Then an example parser for english that will provide the required data (using Illinois POS Tagger)
import LBJ2.nlp.SentenceSplitter; import LBJ2.nlp.WordSplitter; import LBJ2.nlp.seg.PlainToTokenParser; import LBJ2.parse.Parser; import edu.illinois.cs.cogcomp.lbj.chunk.Chunker; import edu.illinois.cs.cogcomp.lbj.pos.POSTagger; import edu.ehu.galan.cvalue.model.Token; ...... List<LinkedList<Token>> tokenizedSentenceList; List<String> sentenceList; POSTagger tagger = new POSTagger(); Chunker chunker = new Chunker(); boolean first = true; parser = new PlainToTokenParser(new WordSplitter(new SentenceSplitter(pFile))); String sentence = ""; LinkedList<Token> tokenList = null; for (LBJ2.nlp.seg.Token word = (LBJ2.nlp.seg.Token) parser.next(); word != null; word = (LBJ2.nlp.seg.Token) parser.next()) { String chunked = chunker.discreteValue(word); tagger.discreteValue(word); if (first) { tokenList = new LinkedList<>(); tokenizedSentenceList.add(tokenList); first = false; } tokenList.add(new Token(word.form, word.partOfSpeech, null, chunked)); sentence = sentence + " " + (word.form); if (word.next == null) { sentenceList.add(sentence); first = true; sentence = ""; } } parser.reset();
Then RAKE can be processed then.....
Document doc=new Document(full_path,name); doc.setSentenceList(sentences); doc.setTokenList(tokenized_sentences); RakeAlgorithm ex = new RakeAlgorithm(); ex.loadStopWordsList("resources/lite/stopWordLists/RakeStopLists/SmartStopListEn"); ex.loadPunctStopWord("resources/lite/stopWordLists/RakeStopLists/RakePunctDefaultStopList"); PlainTextDocumentReaderLBJEn parser = new PlainTextDocumentReaderLBJEn(); parser.readSource("testCorpus/textAstronomy"); Document doc = new Document("full_path", "name"); ex.init(doc); ex.runAlgorithm(); doc.getTermList();