I'm writing a simple xHTML parser which parses a data without nested tags.
Example input code will look like:
<h2>Header</h2>
<p> atsatat </p>
<h2>asdsaad</2><p>s32532235</p>
I've wrote the following code:
#include <node.h>
#include <vector>
#include <string>
#include <sstream>
#include <iostream>
using namespace v8;
enum Tags {
Closing = 0,
Paragraph,
Heading
};
int tagDetect(char *ptr){
if (*ptr == '/') {
return Tags::Closing;
}
if (*ptr == 'p') {
return Tags::Paragraph;
}
if (*(ptr + 1) == '2' || *(ptr + 1) == '3')
return Tags::Heading;
return -1;
}
Handle<Value> Callback(const Arguments& args) {
HandleScope scope;
if (!args[0]->IsString() || !args[1]->IsFunction()) {
return ThrowException(Exception::TypeError(
String::New("Invalid arguments! First parameter must be [string], second must be a callback [function].")));
}
Local<Function> callback = Local<Function>::Cast(args[1]);
String::Utf8Value param1(args[0]->ToString());
std::string input = std::string(*param1);
static Persistent<String> data_symbol = NODE_PSYMBOL("data");
static Persistent<String> tag_symbol = NODE_PSYMBOL("tag");
std::string::size_type pos = 0, closePos = 0;
int openPos;
int tagID, lastTag, id = 0, lastAppended = 0, clearStartPos;
Local<Array> nodes = Array::New();
Local<Object> node_obj;
while ((pos = input.find('<', pos)) != std::string::npos) {
pos++;
switch (tagID = tagDetect(&input[pos])) {
case Tags::Closing:
tagID = tagDetect(&input[pos + 1]);
if (tagID == lastTag && (pos - openPos > 10 || lastTag != Tags::Paragraph)) {
/* Concat tags */
if (lastAppended == lastTag) {
if (lastTag == Tags::Heading) { /* Dont concat heading tags */
continue;
}
node_obj = nodes->Get(id - 1)->ToObject();
node_obj->Set(data_symbol, String::Concat(node_obj->Get(data_symbol)->ToString(), String::New(input.substr(openPos + (lastTag > 1 ? 3 : 2), pos - openPos - (lastTag > 1 ? 3 : 2) - 1).c_str())));
continue;
}
node_obj = Object::New();
node_obj->Set(data_symbol, String::New(input.substr(openPos + (lastTag > 1 ? 3 : 2), pos - openPos - (lastTag > 1 ? 3 : 2) - 1).c_str()));
node_obj->Set(tag_symbol, Integer::New(lastTag));
nodes->Set(id, node_obj);
id++;
lastAppended = lastTag;
continue;
}
/* Clear useless tags */
input.erase(pos - 1, input.find('>', pos - 1) - pos + 2);
break;
case Tags::Paragraph:
case Tags::Heading:
openPos = pos;
lastTag = tagID;
break;
default:
/* Clear useless tags */
input.erase(pos - 1, input.find('>', pos - 1) - pos + 2);
break;
}
}
/* Remove last element from array if its heading */
if (lastTag == Tags::Heading) {
nodes->Delete(id - 1);
}
const unsigned argc = 2;
Local<Value> argv[argc] = {
Local<Value>::New(Null()),
nodes
};
callback->Call(Context::GetCurrent()->Global(), argc, argv);
return Undefined();
}
void RegisterModule(Handle<Object> target) {
target->Set(String::NewSymbol("parse"),
FunctionTemplate::New(Callback)->GetFunction());
}
NODE_MODULE(smartparser, RegisterModule);
I'm using this as Native NodeJS module. Since I'm really new in the C++ world, any tips on how to optimize/improve my code will be greatly appreciated.
-
1\$\begingroup\$ Too many pointers. You should try to replace them by standard library classes whenever possible :p \$\endgroup\$Morwenn– Morwenn2014年04月26日 22:36:35 +00:00Commented Apr 26, 2014 at 22:36
2 Answers 2
Although optional, you can keep your
#include
s organized (alphabetically, for instance) so that it's easier to locate an individual one or to determine if you've already included a particular one.You've included
<sstream>
, but I don't see any use of stringstreams. If you're not utilizing this library, then remove it.Avoid
using namespace X
in global scope as it will expose it to the entire program or file, which can break code in cases such as name-clashing (using the same name for something in which something doesn't belong to thenamespace
). It could also make it difficult for readers to tell what is part of a particularnamespace
(I cannot tell what belongs tov8
, so I'm having a bit more trouble reviewing this). If you insist on keeping ausing
, then keep it local, such as in a function.For this
enum
:enum Tags { Closing = 0, Paragraph, Heading };
You don't need to explicitly set it to
0
. This is already the default starting value forenum
s.You don't need a
char*
here:int tagDetect(char *ptr){
Looking at the calling code, you're passing in an
std::string
. You should just pass this object by itself. Also, because it's not being modified, pass it byconst&
to avoid a needless copy and any accidental modification:int tagDetect(std::string const& str){
In C++ in general, you should avoid passing raw pointers to functions as you cannot keep track of ownership with them (C++11 smart pointers helps with that for when points are necessary). But in this case, you shouldn't need to deal with pointers.
With this change, you will need to modify the function code (I'll use different names for clarity):
This is accessing the first character:
if (*ptr == 'p')
so use
front()
:if (str.front() == 'p')
This is accessing the second one:
*(ptr + 1)
so you can use indices to make it more clear (while they are technically the same thing; yours is doing pointer arithmetic):
str[1]
I'd keep these on separate lines for readability:
int tagID, lastTag, id = 0, lastAppended = 0, clearStartPos;
They aren't entirely related to each other, and there could be some maintenance issues since some are declared and the others initialized.
int tagID; int lastTag; int id = 0; int lastAppended = 0; int clearStartPos;
You should also have these declared/initialized as close to their scope as possible. This will help determine if any variable is no longer used (keeping around unused variables will clutter the code). But if you cannot put them any closer, then do have each one on separate lines.
-
\$\begingroup\$ Thanks! I really appreciate this. Just 1 more thing: Is there a way to improve the algorithm of finding tags? \$\endgroup\$Deepsy– Deepsy2014年04月26日 22:03:10 +00:00Commented Apr 26, 2014 at 22:03
-
\$\begingroup\$ @Deepsy: I haven't quite looked at it yet, but I still can. I'm not too great with algorithms, so someone else may address that. \$\endgroup\$Jamal– Jamal2014年04月26日 22:12:41 +00:00Commented Apr 26, 2014 at 22:12
I'm not that familiar with C++, so I can't comment too much on how idiomatic it is and such, but that said, there are a few concerns I've got.
Reading over the end of a buffer
Consider this input: <
. You'll find that left angle bracket and then call tagDetect
, passing it a pointer to the NUL
terminator. In that function, you then compare that NUL
against a few values (specifically, /
and p
), and then move on to checking the value after the NUL
. By doing that, you're reading past the end of the string. In C, that would be undefined behavior. I imagine it's undefined behavior in C++ as well.
Using uninitialized variables
Consider this input: </p>
. You'll go to your case Tags::Closing
bit and compare to see if tagID
(0) is equal to lastTag
(uninitialized). That could be true or it could be false. Either way, it's not deterministic, and that's bad.
Later in the same condition, you check to see if pos - openPos > 10
. Well, first I wonder why you chose 10, since it seems rather arbitrary, but openPos
is similarly uninitialized. Again, you'll want to make sure that's initialized somehow and that it acts as expected.
Last in that condition, you compare lastTag
against Tags::Paragraph
. Again, lastTag
was uninitialized, so the result could be either true or false.
Potential concatenation bug
If lastAppended == lastTag
and lastTag == Tags::Heading
, you continue
without appending or concatenating. I imagine this is a bug.
Synchronous functions shouldn't use a callback
There's no reason a parser (this one, at least) should be asynchronous, and if it is synchronous (as it is here), there's no reason to use a callback. Just return the result instead.
Explore related questions
See similar questions with these tags.