Optimizing simple xHTML parser

Question 1

I'm writing a simple xHTML parser which parses a data without nested tags.

Example input code will look like:

<h2>Header</h2> 
<p> atsatat </p> 
<h2>asdsaad</2><p>s32532235</p>

I've wrote the following code:

#include <node.h>
#include <vector>
#include <string>
#include <sstream>
#include <iostream>
using namespace v8;
enum Tags {
 Closing = 0,
 Paragraph,
 Heading
};
int tagDetect(char *ptr){
 if (*ptr == '/') {
 return Tags::Closing;
 }
 if (*ptr == 'p') {
 return Tags::Paragraph;
 }
 if (*(ptr + 1) == '2' || *(ptr + 1) == '3')
 return Tags::Heading;
 return -1;
}
Handle<Value> Callback(const Arguments& args) {
 HandleScope scope;
 if (!args[0]->IsString() || !args[1]->IsFunction()) {
 return ThrowException(Exception::TypeError(
 String::New("Invalid arguments! First parameter must be [string], second must be a callback [function].")));
 }
 Local<Function> callback = Local<Function>::Cast(args[1]);
 String::Utf8Value param1(args[0]->ToString());
 std::string input = std::string(*param1);
 static Persistent<String> data_symbol = NODE_PSYMBOL("data");
 static Persistent<String> tag_symbol = NODE_PSYMBOL("tag");
 std::string::size_type pos = 0, closePos = 0;
 int openPos;
 int tagID, lastTag, id = 0, lastAppended = 0, clearStartPos;
 Local<Array> nodes = Array::New();
 Local<Object> node_obj;
 while ((pos = input.find('<', pos)) != std::string::npos) {
 pos++;
 switch (tagID = tagDetect(&input[pos])) {
 case Tags::Closing:
 tagID = tagDetect(&input[pos + 1]);
 if (tagID == lastTag && (pos - openPos > 10 || lastTag != Tags::Paragraph)) {
 /* Concat tags */
 if (lastAppended == lastTag) {
 if (lastTag == Tags::Heading) { /* Dont concat heading tags */
 continue;
 }
 node_obj = nodes->Get(id - 1)->ToObject();
 node_obj->Set(data_symbol, String::Concat(node_obj->Get(data_symbol)->ToString(), String::New(input.substr(openPos + (lastTag > 1 ? 3 : 2), pos - openPos - (lastTag > 1 ? 3 : 2) - 1).c_str())));
 continue;
 }
 node_obj = Object::New();
 node_obj->Set(data_symbol, String::New(input.substr(openPos + (lastTag > 1 ? 3 : 2), pos - openPos - (lastTag > 1 ? 3 : 2) - 1).c_str()));
 node_obj->Set(tag_symbol, Integer::New(lastTag));
 nodes->Set(id, node_obj);
 id++;
 lastAppended = lastTag;
 continue;
 }
 /* Clear useless tags */
 input.erase(pos - 1, input.find('>', pos - 1) - pos + 2);
 break;
 case Tags::Paragraph:
 case Tags::Heading:
 openPos = pos;
 lastTag = tagID;
 break;
 default:
 /* Clear useless tags */
 input.erase(pos - 1, input.find('>', pos - 1) - pos + 2);
 break;
 }
 }
 /* Remove last element from array if its heading */
 if (lastTag == Tags::Heading) {
 nodes->Delete(id - 1);
 }
 const unsigned argc = 2;
 Local<Value> argv[argc] = {
 Local<Value>::New(Null()),
 nodes
 };
 callback->Call(Context::GetCurrent()->Global(), argc, argv);
 return Undefined();
}
void RegisterModule(Handle<Object> target) {
 target->Set(String::NewSymbol("parse"),
 FunctionTemplate::New(Callback)->GetFunction());
}
NODE_MODULE(smartparser, RegisterModule);

I'm using this as Native NodeJS module. Since I'm really new in the C++ world, any tips on how to optimize/improve my code will be greatly appreciated.

Question 2

Too many pointers. You should try to replace them by standard library classes whenever possible :p

Question 3

Although optional, you can keep your #includes organized (alphabetically, for instance) so that it's easier to locate an individual one or to determine if you've already included a particular one.
You've included <sstream>, but I don't see any use of stringstreams. If you're not utilizing this library, then remove it.
Avoid using namespace X in global scope as it will expose it to the entire program or file, which can break code in cases such as name-clashing (using the same name for something in which something doesn't belong to the namespace). It could also make it difficult for readers to tell what is part of a particular namespace (I cannot tell what belongs to v8, so I'm having a bit more trouble reviewing this). If you insist on keeping a using, then keep it local, such as in a function.
For this enum:
```
enum Tags {
 Closing = 0,
 Paragraph,
 Heading
};
```
You don't need to explicitly set it to 0. This is already the default starting value for enums.
You don't need a char* here:
```
int tagDetect(char *ptr){
```
Looking at the calling code, you're passing in an std::string. You should just pass this object by itself. Also, because it's not being modified, pass it by const& to avoid a needless copy and any accidental modification:
```
int tagDetect(std::string const& str){
```
In C++ in general, you should avoid passing raw pointers to functions as you cannot keep track of ownership with them (C++11 smart pointers helps with that for when points are necessary). But in this case, you shouldn't need to deal with pointers.

With this change, you will need to modify the function code (I'll use different names for clarity):

This is accessing the first character:
```
if (*ptr == 'p')
```
so use front():
```
if (str.front() == 'p')
```
This is accessing the second one:
```
*(ptr + 1)
```
so you can use indices to make it more clear (while they are technically the same thing; yours is doing pointer arithmetic):
```
str[1]
```
I'd keep these on separate lines for readability:
```
int tagID, lastTag, id = 0, lastAppended = 0, clearStartPos;
```
They aren't entirely related to each other, and there could be some maintenance issues since some are declared and the others initialized.
```
int tagID;
int lastTag;
int id = 0;
int lastAppended = 0;
int clearStartPos;
```
You should also have these declared/initialized as close to their scope as possible. This will help determine if any variable is no longer used (keeping around unused variables will clutter the code). But if you cannot put them any closer, then do have each one on separate lines.

Question 4

Thanks! I really appreciate this. Just 1 more thing: Is there a way to improve the algorithm of finding tags?

Question 5

@Deepsy: I haven't quite looked at it yet, but I still can. I'm not too great with algorithms, so someone else may address that.

Question 6

I'm not that familiar with C++, so I can't comment too much on how idiomatic it is and such, but that said, there are a few concerns I've got.

Reading over the end of a buffer

Consider this input: <. You'll find that left angle bracket and then call tagDetect, passing it a pointer to the NUL terminator. In that function, you then compare that NUL against a few values (specifically, / and p), and then move on to checking the value after the NUL. By doing that, you're reading past the end of the string. In C, that would be undefined behavior. I imagine it's undefined behavior in C++ as well.

Using uninitialized variables

Consider this input: </p>. You'll go to your case Tags::Closing bit and compare to see if tagID (0) is equal to lastTag (uninitialized). That could be true or it could be false. Either way, it's not deterministic, and that's bad.

Later in the same condition, you check to see if pos - openPos > 10. Well, first I wonder why you chose 10, since it seems rather arbitrary, but openPos is similarly uninitialized. Again, you'll want to make sure that's initialized somehow and that it acts as expected.

Last in that condition, you compare lastTag against Tags::Paragraph. Again, lastTag was uninitialized, so the result could be either true or false.

Potential concatenation bug

If lastAppended == lastTag and lastTag == Tags::Heading, you continue without appending or concatenating. I imagine this is a bug.

Synchronous functions shouldn't use a callback

There's no reason a parser (this one, at least) should be asynchronous, and if it is synchronous (as it is here), there's no reason to use a callback. Just return the result instead.

Jamal Jamal 35.2k13 gold badges134 silver badges238 bronze badges · Accepted Answer · 2014-04-26 21:53:26Z

Although optional, you can keep your #includes organized (alphabetically, for instance) so that it's easier to locate an individual one or to determine if you've already included a particular one.
You've included <sstream>, but I don't see any use of stringstreams. If you're not utilizing this library, then remove it.
Avoid using namespace X in global scope as it will expose it to the entire program or file, which can break code in cases such as name-clashing (using the same name for something in which something doesn't belong to the namespace). It could also make it difficult for readers to tell what is part of a particular namespace (I cannot tell what belongs to v8, so I'm having a bit more trouble reviewing this). If you insist on keeping a using, then keep it local, such as in a function.
For this enum:
```
enum Tags {
 Closing = 0,
 Paragraph,
 Heading
};
```
You don't need to explicitly set it to 0. This is already the default starting value for enums.
You don't need a char* here:
```
int tagDetect(char *ptr){
```
Looking at the calling code, you're passing in an std::string. You should just pass this object by itself. Also, because it's not being modified, pass it by const& to avoid a needless copy and any accidental modification:
```
int tagDetect(std::string const& str){
```
In C++ in general, you should avoid passing raw pointers to functions as you cannot keep track of ownership with them (C++11 smart pointers helps with that for when points are necessary). But in this case, you shouldn't need to deal with pointers.

With this change, you will need to modify the function code (I'll use different names for clarity):

This is accessing the first character:
```
if (*ptr == 'p')
```
so use front():
```
if (str.front() == 'p')
```
This is accessing the second one:
```
*(ptr + 1)
```
so you can use indices to make it more clear (while they are technically the same thing; yours is doing pointer arithmetic):
```
str[1]
```
I'd keep these on separate lines for readability:
```
int tagID, lastTag, id = 0, lastAppended = 0, clearStartPos;
```
They aren't entirely related to each other, and there could be some maintenance issues since some are declared and the others initialized.
```
int tagID;
int lastTag;
int id = 0;
int lastAppended = 0;
int clearStartPos;
```
You should also have these declared/initialized as close to their scope as possible. This will help determine if any variable is no longer used (keeping around unused variables will clutter the code). But if you cannot put them any closer, then do have each one on separate lines.

Thanks! I really appreciate this. Just 1 more thing: Is there a way to improve the algorithm of finding tags?
@Deepsy: I haven't quite looked at it yet, but I still can. I'm not too great with algorithms, so someone else may address that.

Stack Exchange Network

Optimizing simple xHTML parser

2 Answers 2

Reading over the end of a buffer

Using uninitialized variables

Potential concatenation bug

Synchronous functions shouldn't use a callback

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Optimizing simple xHTML parser

2 Answers 2

Reading over the end of a buffer

Using uninitialized variables

Potential concatenation bug

Synchronous functions shouldn't use a callback

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions