My goal in the method is to write a simple clustering program that determines if a clustering document is valid based on these rules:
- All clusterings start with
<
and end with>
. Every start clustering<something>
must be followed (eventually) by an end clustering</something>
. - Clusterings must be properly nested as follows
<a><b>data</b></a>
. This is not allowed:(削除).<a><b>data</a></b>
(削除ここまで) - The first clustering of the document is the root clustering. The document can have only one root clustering (and corresponding end clustering). Everything else must be enclosed inside the root's start and end clusterings. In the example above,
<songdata>
is the root. - All clusterings are case-sensitive. This means
<song>
is not the same as<Song>
, which is not the same as<SONG>
, etc. Start and end clusterings do not have to occur on the same line, but each clustering can not be broken between two lines. So you can't do this:
<dura tion>
The clustering name can contain only letters.
- Everything that is not a clustering is considered document data.
- Extra white space is ignored. Thus,
<a> <b>data</b> </a>
is the same as<a><b>data</b></a>
.
This is the code that I'm using to accomplish my task:
data=in.readLine();
++line;
nextDataValue=data.trim();//removes whitespace from both ends
if(nextDataValue.charAt(0)!='<') error("No enclosing root pillow.", line);//Everything else must be enclosed inside the root's start and end pillows.
StringTokenizer tokenizer1=new StringTokenizer(nextDataValue, ">");
while(tokenizer1.hasMoreTokens())
{
nextDataValue=tokenizer1.nextToken();//extract each data element of the string
if (nextDataValue.indexOf('<') == -1) break;// no < in line, so get next
pillow=nextDataValue.substring(nextDataValue.indexOf('<')+1, nextDataValue.length());// pillow=get < to > exclusive in nextDataValue
if(pillow.charAt(0)!='/')
{
stack.push(pillow);//start
start_pillow_pushed=true;
}
else if(pillow.charAt(0)=='/')//end
{
start_pillow_pushed=false;
if(!stack.isEmpty())
{
if(!nextDataValue.equals(stack.pop())) error("Mismatched start pillow "+nextDataValue+">", line);// For each pillow you find, pop the stack to see if the matching start pillow is there.
}
else error("Missing start pillow "+nextDataValue+">", line);//empty stack
}
}
for(; (data=in.readLine())!=null;)
{
++line;
nextDataValue=data.trim();//removes whitespace from both ends
if(nextDataValue.charAt(0)!='<') error("No enclosing root pillow.", line);//Everything else must be enclosed inside the root's start and end pillows.
StringTokenizer tokenizer2=new StringTokenizer(nextDataValue, ">");
while(tokenizer2.hasMoreTokens())
{
nextDataValue.trim();
nextDataValue=tokenizer2.nextToken();//extract each data element of the string
if (nextDataValue.indexOf('<') == -1) break;// no < in line, so get next
pillow=nextDataValue.substring(nextDataValue.indexOf('<')+1, nextDataValue.length());// pillow=get < to > exclusive in nextDataValue
if(pillow.charAt(0)!='/')
{
stack.push(pillow);//start
start_pillow_pushed=true;
}
else if(pillow.charAt(0)=='/')//end
{
start_pillow_pushed=false;
if(!stack.isEmpty())
{
if(!nextDataValue.equals(stack.pop())) error("Mismatched start pillow "+nextDataValue+">", line);// For each pillow you find, pop the stack to see if the matching start pillow is there.
}
else error("Missing start pillow "+nextDataValue+">", line);//empty stack
}
}
}
if(stack.isEmpty()) System.out.println("Input clustering document is valid.");//reports the document is valid based on the clustering rules above
else error("Missing end pillow for "+nextDataValue+">", line);
Please indicate what can be improved in this.
-
3\$\begingroup\$ The headline of your question. \$\endgroup\$user unknown– user unknown2012年03月02日 06:37:26 +00:00Commented Mar 2, 2012 at 6:37
-
\$\begingroup\$ 1/ It may not be homework, but it does feel like it since you're basically writing a parser of a subset of XML (if it is homework, please add the homework tag). 2/ "clustering" is surprising choice for talking about what is an element. (And clustering already has a lot of meanings.) 3/ What is a pillow? \$\endgroup\$Quentin Pradet– Quentin Pradet2012年03月02日 12:22:04 +00:00Commented Mar 2, 2012 at 12:22
-
\$\begingroup\$ +1 to everything @Cygal said. I would just add that you can probably make the code cleaner and more concise by using recursion, rather than an explicit stack. \$\endgroup\$Alex D– Alex D2012年03月02日 14:36:29 +00:00Commented Mar 2, 2012 at 14:36
1 Answer 1
Thanks for sharing your code! Let's see what can be improved.
Design
You're mixing a lot of things in your code. This makes it difficult to read and write. It also violates the Single Responsibility Principle. I'd recommend separating those three "responsibilities" in different functions/classes. The easiest way to do it is to have three functions:
The handling of multiple lines (String -> String)
readfile("<a> <b>\ndata</b>\n </a>") == "<a> <b> data</b> </a>"
The tokenizing of the tag names (String -> List)
tokenize("<a> <b> data</b> </a>") == ["<a>", "<b>", "data", "</b>", "</a>"]
The use of the stack to ensure that the document is valid (List -> bool)
isValid(["<a>", "<b>", "data", "</b>", "</a>"]) == true
If you're worried about performance (you shouldn't, most XML parsers keep the whole tree in memory), you can always group the first and second function in a Iterator, which will decouple the reading/tokenizing from the validation logic, and you will still be doing stream-based parsing. This simply reduces memory usage, and can be a bit faster.
Code
- You're repeating yourself before and after the loop. Use a
do { } while()
construct to avoid this issue. - Be careful about your indentation. For example,
if(nextDataValue.charAt(0)!='<') error("No enclosing root pillow.", line);
is not the starting of a block, and you should not indent what's after. Java IDEs can indent code if you don't know how to do it. - When checking for
pillow.charAt(0)
, don't useelse if
butelse
for the second block. It makes it clear that the two conditions cover every case. - The
start_pillow_pushed
variable is not used anywhere. Never keep dead code around, it simply confuses the reader.