FSharp.NLP.Stanford.Parser justification or StackOverflow questions understanding.
Some weeks ago, I announced FSharp.NLP.Stanford.Parser and now I want to clarify the goals of this project and show an example of usage.
First of all, this is not an attempt to re-implement some functionality of Stanford Parser. It is just a tiny dust layer that aimed to simplify interaction with Java collections (especially Iterable interface) and bring the power of F# constructs (like pattern matching and discrimination unions) to the code that deals with tagging results.
Task
Let’s start with some sample NLP task: We want to show related questions before user asks a new one (as it works on StackOverflow). There are many possible solutions for this task. Let’s look at one that at the first step tries to understand key phrases that identify this question and runs the search using them.
Approach
First of all, let’s choose some real questions from StackOverflow to analyze them:
- How to make an F# project work with the object browser
- How can I build WebSharper on Mono 3.0 on Mac?
- Adding extra methods as type extensions in F#
- How to get MonoDevelop to compile F# projects?
Now we can use Stanford Parser GUI to visualize the structure of these questions:
We can notice that all phrases that we have selected are parts of noun phrases(NP). As a first solution we can try to analyze tags in the tree and select NP that contains word level tags like (NN,NNS,NNP,NNPS).
Solution
#r @"..\packages\IKVM.7.3.4830.0\lib\IKVM.Runtime.dll"
#r @"..\packages\IKVM.7.3.4830.0\lib\IKVM.OpenJDK.Core.dll"
#r @"..\packages\Stanford.NLP.Parser.3.2.0.0\lib\ejml-0.19-nogui.dll"
#r @"..\packages\Stanford.NLP.Parser.3.2.0.0\lib\stanford-parser.dll"
open edu.stanford.nlp.parser.lexparser
open edu.stanford.nlp.trees
open System
let model = @"d:\englishPCFG.ser.gz";
let options = [|"-maxLength"; "500";"-retainTmpSubcategories"; "-MAX_ITEMS"; "500000";"-outputFormat"; "penn,typedDependenciesCollapsed"|]
let lp = LexicalizedParser.loadModel(model, options)
let tlp = PennTreebankLanguagePack();
let gsf = tlp.grammaticalStructureFactory();
open java.util
let toSeq (iter:Iterator) =
let rec loop (x:Iterator) =
seq {
yield x.next()
if x.hasNext() then
yield! (loop x)
}
loop iter
let getTree question =
let toke = tlp.getTokenizerFactory().getTokenizer(new java.io.StringReader(question));
let sentence = toke.tokenize();
lp.apply(sentence)
let getKeyPhrases (tree:Tree) =
let isNPwithNNx (node:Tree)=
if (node.label().value() <> "NP") then false
else node.getChildrenAsList().iterator()
|> toSeq
|> Seq.cast<Tree>
|> Seq.exists (fun x->
let y = x.label().value()
y= "NN" || y = "NNS" || y = "NNP" || y = "NNPS")
let rec foldTree acc (node:Tree) =
let acc =
if (node.isLeaf()) then acc
else node.getChildrenAsList().iterator()
|> toSeq
|> Seq.cast<Tree>
|> Seq.fold
(fun state x -> foldTree state x)
acc
if isNPwithNNx node
then node :: acc
else acc
foldTree [] tree
let questions =
[|"How to make an F# project work with the object browser";
"How can I build WebSharper on Mono 3.0 on Mac?";
"Adding extra methods as type extensions in F#";
"How to get MonoDevelop to compile F# projects?"|]
questions
|> Seq.iter (fun question ->
printfn "Question : %s" question
question
|> getTree
|> getKeyPhrases
|> List.rev
|> List.iter (fun p ->
p.getLeaves().iterator()
|> toSeq
|> Seq.cast<Tree>
|> Seq.map(fun x-> x.label().value())
|> Seq.toArray
|> printfn "\t%A")
)
If you run this script, you will see the following:
Question : How to make an F# project work with the object browser
[|”an”; “F”; “#”; “project”; “work”|]
[|”the”; “object”; “browser”|]
Question : How can I build WebSharper on Mono 3.0 on Mac?
[|”WebSharper”|]
[|”Mono”; “3.0”|]
[|”Mac”|]
Question : Adding extra methods as type extensions in F#
[|”extra”; “methods”|]
[|”type”; “extensions”|]
[|”F”; “#”|]
Question : How to get MonoDevelop to compile F# projects?
[|”MonoDevelop”|]
[|”F”; “#”; “projects”|]
It is almost what we have expected. Results are good enough, but we can simplify the code and make it more readable using FSharp.NLP.Stanford.Parser.
#r @"..\packages\IKVM.7.3.4830.0\lib\IKVM.Runtime.dll"
#r @"..\packages\IKVM.7.3.4830.0\lib\IKVM.OpenJDK.Core.dll"
#r @"..\packages\Stanford.NLP.Parser.3.2.0.0\lib\ejml-0.19-nogui.dll"
#r @"..\packages\Stanford.NLP.Parser.3.2.0.0\lib\stanford-parser.dll"
#r @"..\packages\FSharp.NLP.Stanford.Parser.0.0.3\lib\FSharp.NLP.Stanford.Parser.dll"
open edu.stanford.nlp.parser.lexparser
open edu.stanford.nlp.trees
open System
open FSharp.IKVM.Util
open FSharp.NLP.Stanford.Parser
let model = @"d:\englishPCFG.ser.gz";
let options = [|"-maxLength"; "500";"-retainTmpSubcategories"; "-MAX_ITEMS"; "500000";"-outputFormat"; "penn,typedDependenciesCollapsed"|]
let lp = LexicalizedParser.loadModel(model, options)
let tlp = PennTreebankLanguagePack();
let gsf = tlp.grammaticalStructureFactory();
let getTree question =
let toke = tlp.getTokenizerFactory().getTokenizer(new java.io.StringReader(question));
let sentence = toke.tokenize();
lp.apply(sentence)
let getKeyPhrases (tree:Tree) =
let isNNx = function
| Label NN | Label NNS | Label NNP | Label NNPS -> true
| _ -> false
let isNPwithNNx = function
| Label NP as node
when node.getChildrenAsList() |> Iterable.castToSeq<Tree> |> Seq.exists isNNx
-> true
| _ -> false
let rec foldTree acc (node:Tree) =
let acc =
if (node.isLeaf()) then acc
else node.getChildrenAsList()
|> Iterable.castToSeq<Tree>
|> Seq.fold
(fun state x -> foldTree state x)
acc
if isNPwithNNx node
then node :: acc
else acc
foldTree [] tree
let questions =
[|"How to make an F# project work with the object browser";
"How can I build WebSharper on Mono 3.0 on Mac?";
"Adding extra methods as type extensions in F#";
"How to get MonoDevelop to compile F# projects?"|]
questions
|> Seq.iter (fun question ->
printfn "Question : %s" question
question
|> getTree
|> getKeyPhrases
|> List.rev
|> List.iter (fun p ->
p.getLeaves()
|> Iterable.castToArray<Tree>
|> Array.map(fun x-> x.label().value())
|> printfn "\t%A")
)
Look more carefully at getKeyPhrases function. All tags are strongly typed now. You can be sure that you will never make a typo, code is more readable and self explained:
STTags
Discover more from Sergey Tihon's Blog
Subscribe to get the latest posts sent to your email.
Published by Sergey Tihon π¦π¦π¦
Father. Husband. Developer. Microsoft MVP. Likes π¦, π¦ and OSS. View all posts by Sergey Tihon π¦π¦π¦
8 thoughts on “FSharp.NLP.Stanford.Parser justification or StackOverflow questions understanding.”
Hi Sergey, Amazing work getting NLP as a Nuget service it’s so easy to use now. Can you help me get “function tags” working? eg I get (NP (NN yesterday)) for “yesterday” but I have seen some people get (NP-TMP (NN yesterday)) showing it’s temporal function. I am using this c# code
static LexicalizedParser lp = LexicalizedParser.loadModel(“c:\\englishPCFG.ser.gz”);
public static string Parse(string sent)
{
CoreLabelTokenFactory cltf = new CoreLabelTokenFactory();
TokenizerFactory tokenizerFactory = PTBTokenizer.factory(cltf, “”);
StringReader sent2Reader = new StringReader(sent);
List rawWords2 = tokenizerFactory.getTokenizer(sent2Reader).tokenize();
Tree parse = lp.apply(rawWords2);
string output = parse.pennString();
return output;
}
Thanks!
Hi, It is temporal NPs feature of Stanford Parser. You need to call LexicalizedParser.loadModel with “-retainTmpSubcategories” option (as it does in my samples).
More about this is here http://nlp.stanford.edu/software/parser-faq.shtml#s
Hi Mr. Tihon,
I’m interested in the phrase chunking extension to Stanford parser in this article. Unfortunately, I’ve never programmed in F#, and I still have problem understanding lambda expression after going through some basic tutorial. Do you have the solution in C# by any chance?
Thanks,
Ellen
Hello, all C# samples available here – http://sergey-tihon.github.io/Stanford.NLP.NET/
I’m looking specifically into the translation of the following lambda expression in C#:
– toSeq (iter:iterator)
– getKeyPhrases (tree: Tree)
I did try to go through C# samples posted in GitHub to find the matching methods, but I failed to find them. Could you kindly link me to the exact location if the source is publicly available?
Sorry, but I do not have C# equivalents for these methods.
toSeq – converts Java iterator to .NET IEnumerable. It should be easy to rewrite it in C#.
But it will be a bit harder to rewrite getKeyPhrases – it is so short and simple due to power of F#.
Hello mates, its wonderful article concerning cultureand entirely explained, keep
it up all the time.