Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
forked from Neuw84/RAKE-Java

A Java implementation of the Rapid Automatic Keyword Extraction Framework ( RAKE )

License

Notifications You must be signed in to change notification settings

simicon/RAKE-Java

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

21 Commits

Repository files navigation

RAKE-Java

A Java 8 implementation of the Rapid Automatic Keyword Extraction (RAKE) algorithm as described in: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic Keyword Extraction from Individual Documents. In M. W. Berry & J. Kogan (Eds.), Text Mining: Theory and Applications: John Wiley & Sons.

The implementation is based on the python one from https://github.com/aneesha/RAKE (however some changes have been made) The source code is released under the GPL V3License.

Add this repository to your POM.XML whether you want to use it with maven

 <repository>
 <id>galan-maven-repo</id>
 <name>galan-maven-repo-releases</name>
 <url>http://galan.ehu.es/artifactory/ext-release-local</url>
 </repository>

This implementation requires a POS tagger to be used in order to work. For example The Illinois POS tagger could be used for English.

http://cogcomp.cs.illinois.edu/page/software_view/POS

For Spanish or other languages:

FreeLing --> http://nlp.lsi.upc.edu/freeling/

or Standford Pos tagger --> http://nlp.stanford.edu/software/tagger.shtml

The implementation is in beta state

TODO:

 - More testing 

Then an example parser for english that will provide the required data (using Illinois POS Tagger)

 import LBJ2.nlp.SentenceSplitter;
 import LBJ2.nlp.WordSplitter;
 import LBJ2.nlp.seg.PlainToTokenParser;
 import LBJ2.parse.Parser;
 import edu.illinois.cs.cogcomp.lbj.chunk.Chunker;
 import edu.illinois.cs.cogcomp.lbj.pos.POSTagger;
 import edu.ehu.galan.cvalue.model.Token;
 ......
 List<LinkedList<Token>> tokenizedSentenceList;
 List<String> sentenceList;
 POSTagger tagger = new POSTagger();
 Chunker chunker = new Chunker();
 boolean first = true;
 parser = new PlainToTokenParser(new WordSplitter(new SentenceSplitter(pFile)));
 String sentence = "";
 LinkedList<Token> tokenList = null;
 for (LBJ2.nlp.seg.Token word = (LBJ2.nlp.seg.Token) parser.next(); word != null;
 word = (LBJ2.nlp.seg.Token) parser.next()) {
 String chunked = chunker.discreteValue(word);
 tagger.discreteValue(word);
 if (first) {
 tokenList = new LinkedList<>();
 tokenizedSentenceList.add(tokenList);
 first = false;
 }
 tokenList.add(new Token(word.form, word.partOfSpeech, null, chunked));
 sentence = sentence + " " + (word.form);
 if (word.next == null) {
 sentenceList.add(sentence);
 first = true;
 sentence = "";
 }
 }
 parser.reset();
 

Then RAKE can be processed then.....

 Document doc=new Document(full_path,name);
 doc.setSentenceList(sentences);
 doc.setTokenList(tokenized_sentences); 
 RakeAlgorithm ex = new RakeAlgorithm();
 ex.loadStopWordsList("resources/lite/stopWordLists/RakeStopLists/SmartStopListEn");
 ex.loadPunctStopWord("resources/lite/stopWordLists/RakeStopLists/RakePunctDefaultStopList");
 PlainTextDocumentReaderLBJEn parser = new PlainTextDocumentReaderLBJEn();
 parser.readSource("testCorpus/textAstronomy");
 Document doc = new Document("full_path", "name");
 ex.init(doc);
 ex.runAlgorithm();
 doc.getTermList();

About

A Java implementation of the Rapid Automatic Keyword Extraction Framework ( RAKE )

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 100.0%

AltStyle によって変換されたページ (->オリジナル) /