Build a sentence from tokens / words in a String-Array
I'm facing an interesting issue at the moment:
My Situation:
I'm having (in Java) String-Arrays like the following (more complicated, of course). Each String-Array represents one sentence (I cant change the representation):
String[] tokens = {"This", "is", "just", "an", "example", "."};
My Problem:
I want to rebuild the original sentences from this String-Arrays. This doesn't sound that hard at first, but becomes really complex since sentence structure can have many cases. Sometimes you need whitespaces and sometimes you don't.
My Approach:
I've implemented a method that should do most of the tasks, which means rebuilding a sentence from the original String-Array. As you can see, it's very complex and complicated already, but works "okay" for the moment - I don't know how to improve it at the moment.
public static String detokenize(String[] tokens) {
StringBuilder sentence = new StringBuilder();
boolean sentenceInQuotation = false;
boolean firstWordInQuotationSentence = false;
boolean firstWordInParenthisis = false;
boolean date = false;
for (int i = 0; i < tokens.length; i++) {
if (tokens[i].equals(".") || tokens[i].equals(";") || tokens[i].equals(",") || tokens[i].equals("?") || tokens[i].equals("!")) {
sentence.append(tokens[i]);
}
else if(tokens[i].equals(":")){
Pattern p = Pattern.compile("\\d");
Matcher m = p.matcher(tokens[i-1]);
if(m.find() == true){
date = true;
}
sentence.append(tokens[i]);
}
else if(tokens[i].equals("(")){
sentence.append(" ");
sentence.append(tokens[i]);
firstWordInParenthisis = true;
}
else if (tokens[i].equals(")")) {
sentence.append(tokens[i]);
firstWordInParenthisis = false;
}
else if(tokens[i].equals("\"")){
if(sentenceInQuotation == false){
sentence.append(" ");
sentence.append(tokens[i]);
sentenceInQuotation = true;
firstWordInQuotationSentence = true;
}
else if(sentenceInQuotation == true){
sentence.append(tokens[i]);
sentenceInQuotation = false;
}
}
else if (tokens[i].equals("&") || tokens[i].equals("+") || tokens[i].equals("=")) {
sentence.append(" ");
sentence.append(tokens[i]);
}
//words
else {
if(sentenceInQuotation == true){
if(firstWordInQuotationSentence == true){
sentence.append(tokens[i]);
firstWordInQuotationSentence = false;
}
else if(firstWordInQuotationSentence == false){
if(firstWordInParenthisis == true){
sentence.append(tokens[i]);
firstWordInParenthisis = false;
}
else if(firstWordInParenthisis == false){
sentence.append(" ");
sentence.append(tokens[i]);
}
}
}
else if(firstWordInParenthisis == true){
sentence.append(tokens[i]);
firstWordInParenthisis = false;
}
else if(date == true){
sentence.append(tokens[i]);
date = false;
}
else if(sentenceInQuotation == false){
sentence.append(" ");
sentence.append(tokens[i]);
}
}
}
return sentence.toString().replaceFirst(" ", "");
}
As I said, this works quite good, but not perfect. I suggest you try my method with copy/paste and see it on your own.
Do you have ANY ideas or a better solution for my problem?
Examples:
For example, as I just tried some texts out I noticed that I don't yet check about tokens like "[", "]", or e.g. the different types of quotations, " or ". I also heard that it can make a different if if use ... (three points) or one ... unicode sign (mark it and you'll see it). So it becomes more and more complex.
- 153
- 1
- 1
- 4