Validating a clustering document

Question 1

My goal in the method is to write a simple clustering program that determines if a clustering document is valid based on these rules:

All clusterings start with < and end with >. Every start clustering <something> must be followed (eventually) by an end clustering </something>.
Clusterings must be properly nested as follows <a>data</a>. This is not allowed: ~~(削除) <a>data</a> (削除ここまで)~~.
The first clustering of the document is the root clustering. The document can have only one root clustering (and corresponding end clustering). Everything else must be enclosed inside the root's start and end clusterings. In the example above, <songdata> is the root.
All clusterings are case-sensitive. This means <song> is not the same as <Song>, which is not the same as <SONG>, etc.
Start and end clusterings do not have to occur on the same line, but each clustering can not be broken between two lines. So you can't do this:
```
<dura
tion>
```
The clustering name can contain only letters.
Everything that is not a clustering is considered document data.
Extra white space is ignored. Thus, <a> data </a> is the same as <a>data</a>.

This is the code that I'm using to accomplish my task:

data=in.readLine();
++line;
nextDataValue=data.trim();//removes whitespace from both ends
if(nextDataValue.charAt(0)!='<') error("No enclosing root pillow.", line);//Everything else must be enclosed inside the root's start and end pillows.
 StringTokenizer tokenizer1=new StringTokenizer(nextDataValue, ">");
 while(tokenizer1.hasMoreTokens())
 {
 nextDataValue=tokenizer1.nextToken();//extract each data element of the string
 if (nextDataValue.indexOf('<') == -1) break;// no < in line, so get next
 pillow=nextDataValue.substring(nextDataValue.indexOf('<')+1, nextDataValue.length());// pillow=get < to > exclusive in nextDataValue
 if(pillow.charAt(0)!='/')
 {
 stack.push(pillow);//start
 start_pillow_pushed=true;
 }
 else if(pillow.charAt(0)=='/')//end
 {
 start_pillow_pushed=false;
 if(!stack.isEmpty()) 
 {
 if(!nextDataValue.equals(stack.pop())) error("Mismatched start pillow "+nextDataValue+">", line);// For each pillow you find, pop the stack to see if the matching start pillow is there.
 }
 else error("Missing start pillow "+nextDataValue+">", line);//empty stack
 }
 }
for(; (data=in.readLine())!=null;)
{
++line;
nextDataValue=data.trim();//removes whitespace from both ends
if(nextDataValue.charAt(0)!='<') error("No enclosing root pillow.", line);//Everything else must be enclosed inside the root's start and end pillows.
 StringTokenizer tokenizer2=new StringTokenizer(nextDataValue, ">");
 while(tokenizer2.hasMoreTokens())
 {
 nextDataValue.trim();
 nextDataValue=tokenizer2.nextToken();//extract each data element of the string
 if (nextDataValue.indexOf('<') == -1) break;// no < in line, so get next
 pillow=nextDataValue.substring(nextDataValue.indexOf('<')+1, nextDataValue.length());// pillow=get < to > exclusive in nextDataValue
 if(pillow.charAt(0)!='/')
 {
 stack.push(pillow);//start
 start_pillow_pushed=true;
 }
 else if(pillow.charAt(0)=='/')//end
 {
 start_pillow_pushed=false;
 if(!stack.isEmpty()) 
 {
 if(!nextDataValue.equals(stack.pop())) error("Mismatched start pillow "+nextDataValue+">", line);// For each pillow you find, pop the stack to see if the matching start pillow is there.
 }
 else error("Missing start pillow "+nextDataValue+">", line);//empty stack
 }
 }
}
if(stack.isEmpty()) System.out.println("Input clustering document is valid.");//reports the document is valid based on the clustering rules above
else error("Missing end pillow for "+nextDataValue+">", line);

Please indicate what can be improved in this.

Question 2

The headline of your question.

Question 3

1/ It may not be homework, but it does feel like it since you're basically writing a parser of a subset of XML (if it is homework, please add the homework tag). 2/ "clustering" is surprising choice for talking about what is an element. (And clustering already has a lot of meanings.) 3/ What is a pillow?

Question 4

+1 to everything @Cygal said. I would just add that you can probably make the code cleaner and more concise by using recursion, rather than an explicit stack.

Question 5

Thanks for sharing your code! Let's see what can be improved.

Design

You're mixing a lot of things in your code. This makes it difficult to read and write. It also violates the Single Responsibility Principle. I'd recommend separating those three "responsibilities" in different functions/classes. The easiest way to do it is to have three functions:

The handling of multiple lines (String -> String)

readfile("<a> <b>\ndata</b>\n </a>") == "<a> <b> data</b> </a>"

The tokenizing of the tag names (String -> List)

tokenize("<a> <b> data</b> </a>") == ["<a>", "<b>", "data", "</b>", "</a>"]

The use of the stack to ensure that the document is valid (List -> bool)
```
isValid(["<a>", "", "data", "", "</a>"]) == true
```

If you're worried about performance (you shouldn't, most XML parsers keep the whole tree in memory), you can always group the first and second function in a Iterator, which will decouple the reading/tokenizing from the validation logic, and you will still be doing stream-based parsing. This simply reduces memory usage, and can be a bit faster.

Code

You're repeating yourself before and after the loop. Use a do { } while() construct to avoid this issue.
Be careful about your indentation. For example, if(nextDataValue.charAt(0)!='<') error("No enclosing root pillow.", line); is not the starting of a block, and you should not indent what's after. Java IDEs can indent code if you don't know how to do it.
When checking for pillow.charAt(0), don't use else if but else for the second block. It makes it clear that the two conditions cover every case.
The start_pillow_pushed variable is not used anywhere. Never keep dead code around, it simply confuses the reader.

Quentin Pradet Quentin Pradet 7,0641 gold badge25 silver badges44 bronze badges · Answer 1 · 2012-03-02 12:43:45Z

Thanks for sharing your code! Let's see what can be improved.

Design

You're mixing a lot of things in your code. This makes it difficult to read and write. It also violates the Single Responsibility Principle. I'd recommend separating those three "responsibilities" in different functions/classes. The easiest way to do it is to have three functions:

The handling of multiple lines (String -> String)

readfile("<a> <b>\ndata</b>\n </a>") == "<a> <b> data</b> </a>"

The tokenizing of the tag names (String -> List)

tokenize("<a> <b> data</b> </a>") == ["<a>", "<b>", "data", "</b>", "</a>"]

The use of the stack to ensure that the document is valid (List -> bool)
```
isValid(["<a>", "", "data", "", "</a>"]) == true
```

If you're worried about performance (you shouldn't, most XML parsers keep the whole tree in memory), you can always group the first and second function in a Iterator, which will decouple the reading/tokenizing from the validation logic, and you will still be doing stream-based parsing. This simply reduces memory usage, and can be a bit faster.

Code

You're repeating yourself before and after the loop. Use a do { } while() construct to avoid this issue.
Be careful about your indentation. For example, if(nextDataValue.charAt(0)!='<') error("No enclosing root pillow.", line); is not the starting of a block, and you should not indent what's after. Java IDEs can indent code if you don't know how to do it.
When checking for pillow.charAt(0), don't use else if but else for the second block. It makes it clear that the two conditions cover every case.
The start_pillow_pushed variable is not used anywhere. Never keep dead code around, it simply confuses the reader.

Stack Exchange Network

Validating a clustering document

1 Answer 1

Design

Code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Validating a clustering document

1 Answer 1

Design

Code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions