Index a document and perform analyses on its term vector

Question 1

I have the following scenario: I create a lucene document from a potentially very large text. Apart from indexing the text, I perform some analysis on the document for which I need the document's term frequency vector. The results of this analysis also need to be stored in the lucene document/index. Here is my current approach:

With the following method I compute the text's term vector (_analyzer is an instance of some Lucene.Net.Analysis.Analyzer):

public Dictionary<string, int> GetTermVector(string text)
{
 var termVector = new Dictionary<string, int>();
 using (var stringReader = new StringReader(text))
 {
 var tokenStream = _analyzer.TokenStream("", stringReader);
 var charTermAttribute = tokenStream.GetAttribute<ITermAttribute>();
 while (tokenStream.IncrementToken())
 {
 var term = charTermAttribute.Term;
 if (termVector.ContainsKey(term)) termVector[term]++;
 else termVector.Add(term, 1);
 }
 return termVector;
 }
}

The above method is used by a method which basically sends the 500 most frequent terms to a webservice and returns the results.

public static class CategorizationService
{
 private static ScoringServiceClient _service;
 public static Dictionary<Guid, double> Categorize(string text, Language language)
 {
 var tokenizer = language == Language.English
 ? new StringTokenizer(new PrimaryAnalyzer())
 : new StringTokenizer(new PrimaryAnalyzer("German"));
 var termVector = tokenizer
 .GetTermVector(text)
 .OrderByDescending(p => p.Value)
 .ToDictionary(p => p.Key, p => p.Value)
 .Take(500);
 // Create string representation of the term vector which is consumed by webservice
 var termVectorString = termVector.Aggregate(string.Empty, (s, pair) => s + $"{pair.Key};{pair.Value}\n");
 try
 {
 var isoLanguageString = language == Language.English ? "en" : "de";
 if (_service == null) _service = new ScoringServiceClient();
 _service.ClientCredentials.UserName.UserName = Program.UserName;
 _service.ClientCredentials.UserName.Password = Program.UserPassword;
 var categories = _service.Categorize(termVectorString, isoLanguageString);
 return categories;
 }
 catch (Exception e)
 {
 Console.WriteLine(e.Message);
 _service.SafeDispose();
 _service = null;
 throw;
 }
 }
}

Afterwards, everything is packaged into a lucene document which I add to the index. Last step is committing the index changes. Here is the method which calls the analysis and creates the document:

var itemContent = ReadItemContent();
var deContent = language == Language.German ? itemContent : string.Empty;
var enContent = language == Language.English ? itemContent : string.Empty;
try
{
 categories = CategorizationService.Categorize(itemContent, language);
 isCategorized = true;
}
catch (Exception) {}
// Transform analysis results into an indexable string 
var categoriesString = categories.Aggregate(string.Empty, (seed, categorization) => $"{seed} \r\n {categorization.Key} {categorization.Value}");
var document = new Document();
document.Add(new Field("Id", $"localfile:{Guid.NewGuid()}", Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.YES)); 
// More field added here
Console.WriteLine("Adding document...");
// In this step, lucene analyzes the document again
_writer.AddDocument(document);
Console.WriteLine("Commiting index...");
_writer.Commit();
Console.WriteLine("Reopening reader...");
_reader = _reader.Reopen();
Console.WriteLine("Reading term vector");

Obviously there is some wastage involved in this approach. I am now thinking about how to avoid running the analyzer over the entire text twice (first time for my own analysis, second for indexing the document). That would mean, however, that I need to index the document first, then perform my own analysis and then update the document to contain the analysis results. I am not sure if this leads to a double indexing again...

So what are your ideas to perform this operation flow as efficiently as possible?

Question 2

@Heslacher: It returns a Dictionary<Guid, double>. In the method above I simply return the webservice's return value.

Question 3

Seeing Dictionary.ContainsKey() if the value will be used is hurting my eyes.

Please use TryGetValue() like so

public Dictionary<string, int> GetTermVector(string text)
{
 var termVector = new Dictionary<string, int>();
 using (var stringReader = new StringReader(text))
 {
 var tokenStream = _analyzer.TokenStream("", stringReader);
 var charTermAttribute = tokenStream.GetAttribute<ITermAttribute>();
 while (tokenStream.IncrementToken())
 {
 var term = charTermAttribute.Term;
 int value;
 termVector.TryGetValue(term, out value);
 termVector[term] = value + 1;
 }
 return termVector;
 }
}

Question 4

Thanks for the hint! Any ideas regarding the question?

Question 5

I think this part can be optimized:

var termVector = tokenizer
 .GetTermVector(text)
 .OrderByDescending(p => p.Value)
 .ToDictionary(p => p.Key, p => p.Value)
 .Take(500);

in a way that you first take the 500 items and push them into a dictionary instead of craeting a dictionary for the entire collection and then getting only the first 500 items:

var termVector = tokenizer
 .GetTermVector(text)
 .OrderByDescending(p => p.Value)
 .Take(500)
 .ToDictionary(p => p.Key, p => p.Value);

A minor one but perhaps it makes a difference ;-)

Also this section is not the prittiest:

var tokenizer = language == Language.English
 ? new StringTokenizer(new PrimaryAnalyzer())
 : new StringTokenizer(new PrimaryAnalyzer("German"));

why not simply like this?

var tokenizer = new StringTokenizer(new PrimaryAnalyzer(language.ToString()));

And another one:

var isoLanguageString = language == Language.English ? "en" : "de";

I think you should either create a dictionary for it or create custom attributes for the enum:

enum Language
{
 [TwoLetterIsoCode("en")]
 English,
 [TwoLetterIsoCode("de")]
 German
}

For more information about enums and custom attributes refer to How to get Custom Attribute values for enums? on Stack Overflow.

With a small extension method you could then do the following

var isoLanguageString = language.GetTwoLetterIsoCode();

There is also a line that has misleading names in it:

var categoriesString = categories.Aggregate(string.Empty, (seed, categorization) => $"{seed} \r\n {categorization.Key} {categorization.Value}");

You call it seed but it's not a seed, here the string.Empty is the seed and your seed is the current value (accumulator).

 public static TAccumulate Aggregate<TSource, TAccumulate>(
 this IEnumerable<TSource> source,
 TAccumulate seed,
 Func<TAccumulate, TSource, TAccumulate> func)

Heslacher Heslacher 50.9k5 gold badges83 silver badges177 bronze badges · Answer 1 · 2015-11-06 10:46:20Z

Seeing Dictionary.ContainsKey() if the value will be used is hurting my eyes.

Please use TryGetValue() like so

public Dictionary<string, int> GetTermVector(string text)
{
 var termVector = new Dictionary<string, int>();
 using (var stringReader = new StringReader(text))
 {
 var tokenStream = _analyzer.TokenStream("", stringReader);
 var charTermAttribute = tokenStream.GetAttribute<ITermAttribute>();
 while (tokenStream.IncrementToken())
 {
 var term = charTermAttribute.Term;
 int value;
 termVector.TryGetValue(term, out value);
 termVector[term] = value + 1;
 }
 return termVector;
 }
}

\$\begingroup\$ Thanks for the hint! Any ideas regarding the question? \$\endgroup\$

Marc
– Marc

2015年11月06日 10:52:10 +00:00
Commented Nov 6, 2015 at 10:52

t3chb0t t3chb0t 44.6k9 gold badges84 silver badges190 bronze badges · Answer 2 · 2015-11-06 11:56:33Z

I think this part can be optimized:

var termVector = tokenizer
 .GetTermVector(text)
 .OrderByDescending(p => p.Value)
 .ToDictionary(p => p.Key, p => p.Value)
 .Take(500);

in a way that you first take the 500 items and push them into a dictionary instead of craeting a dictionary for the entire collection and then getting only the first 500 items:

var termVector = tokenizer
 .GetTermVector(text)
 .OrderByDescending(p => p.Value)
 .Take(500)
 .ToDictionary(p => p.Key, p => p.Value);

A minor one but perhaps it makes a difference ;-)

Also this section is not the prittiest:

var tokenizer = language == Language.English
 ? new StringTokenizer(new PrimaryAnalyzer())
 : new StringTokenizer(new PrimaryAnalyzer("German"));

why not simply like this?

var tokenizer = new StringTokenizer(new PrimaryAnalyzer(language.ToString()));

And another one:

var isoLanguageString = language == Language.English ? "en" : "de";

I think you should either create a dictionary for it or create custom attributes for the enum:

enum Language
{
 [TwoLetterIsoCode("en")]
 English,
 [TwoLetterIsoCode("de")]
 German
}

For more information about enums and custom attributes refer to How to get Custom Attribute values for enums? on Stack Overflow.

With a small extension method you could then do the following

var isoLanguageString = language.GetTwoLetterIsoCode();

There is also a line that has misleading names in it:

var categoriesString = categories.Aggregate(string.Empty, (seed, categorization) => $"{seed} \r\n {categorization.Key} {categorization.Value}");

You call it seed but it's not a seed, here the string.Empty is the seed and your seed is the current value (accumulator).

 public static TAccumulate Aggregate<TSource, TAccumulate>(
 this IEnumerable<TSource> source,
 TAccumulate seed,
 Func<TAccumulate, TSource, TAccumulate> func)

Stack Exchange Network

Index a document and perform analyses on its term vector

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Index a document and perform analyses on its term vector

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions