I have the following scenario: I create a lucene document from a potentially very large text. Apart from indexing the text, I perform some analysis on the document for which I need the document's term frequency vector. The results of this analysis also need to be stored in the lucene document/index. Here is my current approach:
With the following method I compute the text's term vector (_analyzer
is an instance of some Lucene.Net.Analysis.Analyzer
):
public Dictionary<string, int> GetTermVector(string text)
{
var termVector = new Dictionary<string, int>();
using (var stringReader = new StringReader(text))
{
var tokenStream = _analyzer.TokenStream("", stringReader);
var charTermAttribute = tokenStream.GetAttribute<ITermAttribute>();
while (tokenStream.IncrementToken())
{
var term = charTermAttribute.Term;
if (termVector.ContainsKey(term)) termVector[term]++;
else termVector.Add(term, 1);
}
return termVector;
}
}
The above method is used by a method which basically sends the 500 most frequent terms to a webservice and returns the results.
public static class CategorizationService
{
private static ScoringServiceClient _service;
public static Dictionary<Guid, double> Categorize(string text, Language language)
{
var tokenizer = language == Language.English
? new StringTokenizer(new PrimaryAnalyzer())
: new StringTokenizer(new PrimaryAnalyzer("German"));
var termVector = tokenizer
.GetTermVector(text)
.OrderByDescending(p => p.Value)
.ToDictionary(p => p.Key, p => p.Value)
.Take(500);
// Create string representation of the term vector which is consumed by webservice
var termVectorString = termVector.Aggregate(string.Empty, (s, pair) => s + $"{pair.Key};{pair.Value}\n");
try
{
var isoLanguageString = language == Language.English ? "en" : "de";
if (_service == null) _service = new ScoringServiceClient();
_service.ClientCredentials.UserName.UserName = Program.UserName;
_service.ClientCredentials.UserName.Password = Program.UserPassword;
var categories = _service.Categorize(termVectorString, isoLanguageString);
return categories;
}
catch (Exception e)
{
Console.WriteLine(e.Message);
_service.SafeDispose();
_service = null;
throw;
}
}
}
Afterwards, everything is packaged into a lucene document which I add to the index. Last step is committing the index changes. Here is the method which calls the analysis and creates the document:
var itemContent = ReadItemContent();
var deContent = language == Language.German ? itemContent : string.Empty;
var enContent = language == Language.English ? itemContent : string.Empty;
try
{
categories = CategorizationService.Categorize(itemContent, language);
isCategorized = true;
}
catch (Exception) {}
// Transform analysis results into an indexable string
var categoriesString = categories.Aggregate(string.Empty, (seed, categorization) => $"{seed} \r\n {categorization.Key} {categorization.Value}");
var document = new Document();
document.Add(new Field("Id", $"localfile:{Guid.NewGuid()}", Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.YES));
// More field added here
Console.WriteLine("Adding document...");
// In this step, lucene analyzes the document again
_writer.AddDocument(document);
Console.WriteLine("Commiting index...");
_writer.Commit();
Console.WriteLine("Reopening reader...");
_reader = _reader.Reopen();
Console.WriteLine("Reading term vector");
Obviously there is some wastage involved in this approach. I am now thinking about how to avoid running the analyzer over the entire text twice (first time for my own analysis, second for indexing the document). That would mean, however, that I need to index the document first, then perform my own analysis and then update the document to contain the analysis results. I am not sure if this leads to a double indexing again...
So what are your ideas to perform this operation flow as efficiently as possible?
-
\$\begingroup\$ @Heslacher: It returns a Dictionary<Guid, double>. In the method above I simply return the webservice's return value. \$\endgroup\$Marc– Marc2015年11月06日 11:53:38 +00:00Commented Nov 6, 2015 at 11:53
2 Answers 2
Seeing Dictionary.ContainsKey()
if the value will be used is hurting my eyes.
Please use TryGetValue()
like so
public Dictionary<string, int> GetTermVector(string text)
{
var termVector = new Dictionary<string, int>();
using (var stringReader = new StringReader(text))
{
var tokenStream = _analyzer.TokenStream("", stringReader);
var charTermAttribute = tokenStream.GetAttribute<ITermAttribute>();
while (tokenStream.IncrementToken())
{
var term = charTermAttribute.Term;
int value;
termVector.TryGetValue(term, out value);
termVector[term] = value + 1;
}
return termVector;
}
}
-
\$\begingroup\$ Thanks for the hint! Any ideas regarding the question? \$\endgroup\$Marc– Marc2015年11月06日 10:52:10 +00:00Commented Nov 6, 2015 at 10:52
I think this part can be optimized:
var termVector = tokenizer .GetTermVector(text) .OrderByDescending(p => p.Value) .ToDictionary(p => p.Key, p => p.Value) .Take(500);
in a way that you first take the 500 items and push them into a dictionary instead of craeting a dictionary for the entire collection and then getting only the first 500 items:
var termVector = tokenizer
.GetTermVector(text)
.OrderByDescending(p => p.Value)
.Take(500)
.ToDictionary(p => p.Key, p => p.Value);
A minor one but perhaps it makes a difference ;-)
Also this section is not the prittiest:
var tokenizer = language == Language.English ? new StringTokenizer(new PrimaryAnalyzer()) : new StringTokenizer(new PrimaryAnalyzer("German"));
why not simply like this?
var tokenizer = new StringTokenizer(new PrimaryAnalyzer(language.ToString()));
And another one:
var isoLanguageString = language == Language.English ? "en" : "de";
I think you should either create a dictionary for it or create custom attributes for the enum:
enum Language
{
[TwoLetterIsoCode("en")]
English,
[TwoLetterIsoCode("de")]
German
}
For more information about enums and custom attributes refer to How to get Custom Attribute values for enums? on Stack Overflow.
With a small extension method you could then do the following
var isoLanguageString = language.GetTwoLetterIsoCode();
There is also a line that has misleading names in it:
var categoriesString = categories.Aggregate(string.Empty, (seed, categorization) => $"{seed} \r\n {categorization.Key} {categorization.Value}");
You call it seed
but it's not a seed, here the string.Empty
is the seed and your seed is the current value (accumulator).
public static TAccumulate Aggregate<TSource, TAccumulate>( this IEnumerable<TSource> source, TAccumulate seed, Func<TAccumulate, TSource, TAccumulate> func)