As part of a larger project I'm writing to learn C++, I'm trying to read through some almost-XML files generated by another program. Unfortunately this program uses its own custom escaping logic, so some of these files aren't actually valid XML. As far as I can tell, its escape logic works like this:
- Attribute values are always surrounded with double quotes. The only escape sequence is
\"
, evaluating to a double quote. Backslashes elsewhere are just interpreted literally; it's impossible for a value to end with a backslash. - Nested elements always have the opening outer tag on one line, each inner tag on its own line, and the closing outer tag on another.
- Tags with text content are always entirely on a single line, and interpret all characters between the opening tag and closing tag literally. It is impossible for a copy of the closing tag to be in the text content, but other tags may be.
In practice the files have a bit more structure than this, but I've been trying to work off of just these assumptions to keep the code flexible.
As an example, this is a possible, invalid XML, input:
<outer attr="&value\"">
<empty />
<inner><notnested></outer></inner>
</outer>
Which is equivalent to this valid XML:
<outer attr="&amp;value"">
<empty />
<inner><notnested></outer></inner>
</outer>
My approach to dealing with this is to do some simplified parsing over these files to convert them into valid XML, before feeding them into a more traditional XML parser (leaning towards pugixml, but open to change if it makes things easier). Here are the relevant functions in a small demo program:
#include <cctype>
#include <fstream>
#include <iostream>
#include <sstream>
#include <string>
static void put_xml_escaped(char c, std::stringstream& stream) {
switch (c) {
case '"': stream << """; break;
case '\'': stream << "'"; break;
case '<': stream << "<"; break;
case '>': stream << ">"; break;
case '&': stream << "&"; break;
default: stream << c; break;
}
}
static std::stringstream preprocess_file(std::string filename) {
enum tag_parse_state {
// " <tag attr="val">content</tag> "
before_tag, // ^^^^
tag_name, // ^^^^
tag_body, // ^^^^^^^ ^
attr, // ^^^^
content, // ^^^^^^^
closing_tag, // ^^^^^^
closed // ^^^^
};
std::ifstream file{filename};
std::stringstream processed{};
for (std::string line; std::getline(file, line); ) {
tag_parse_state state = before_tag;
size_t tag_name_start;
size_t closing_tag_start, closing_tag_end;
for (size_t i = 0; i < line.size(); i++) {
auto c = line[i];
if (state == before_tag) {
if (c == '<') {
processed << c;
state = tag_name;
tag_name_start = i + 1;
} else if (!isspace(c)) {
throw std::runtime_error(
"Failed to parse line (text before inital tag): " + line
);
}
} else if (state == tag_name) {
processed << c;
if (c == ' ' || c == '>') {
// Nice little coincidence: if we're parsing over a closing tag, this will
// double up on the '/' and thus not find itself
auto closing_tag_str = (
"</" + line.substr(tag_name_start, i - tag_name_start) + ">"
);
closing_tag_start = line.rfind(closing_tag_str);
closing_tag_end = (
closing_tag_start == std::string::npos
? std::string::npos
: closing_tag_start + closing_tag_str.size() - 1
);
if (c == ' ') {
state = tag_body;
} else {
state = closing_tag_start == std::string::npos ? closed : content;
}
}
} else if (state == tag_body) {
processed << c;
if (c == '"') {
state = attr;
} else if (c == '>') {
state = closing_tag_start == std::string::npos ? closed : content;
}
} else if (state == attr) {
if (c == '"') {
processed << c;
state = tag_body;
} else if (c == '\\' && i + 1 < line.size() && line[i + 1] == '"') {
put_xml_escaped('"', processed);
i++;
} else {
put_xml_escaped(c, processed);
}
} else if (state == content) {
put_xml_escaped(c, processed);
if (i + 1 >= closing_tag_start) {
state = closing_tag;
}
} else if (state == closing_tag) {
processed << c;
if (i >= closing_tag_end) {
state = closed;
}
} else if (state == closed) {
if (!isspace(c)) {
throw std::runtime_error(
"Failed to parse line (text after closing tag): " + line
);
}
}
}
if (state != closed) {
throw std::runtime_error("Failed to parse line (missed closing tag): " + line);
}
}
return processed;
}
int main() {
std::cout << preprocess_file("example.xml").str() << "\n";
return 0;
}
I'm interested in feedback on:
- General best practices (minus code style)
- Edge cases I may have missed
- Algorithmic improvements - I don't think you can do much better than O(n) when using a state machine like this, but maybe there's a completely different approach.
-
\$\begingroup\$ This isn't really an XML question. It's about a proprietary data syntax that has some resemblances to XML, but you can't use any XML tooling, so this doesn't help you. \$\endgroup\$Michael Kay– Michael Kay2021年07月30日 16:31:22 +00:00Commented Jul 30, 2021 at 16:31
-
\$\begingroup\$ Fair enough, removed that tag. \$\endgroup\$apple1417– apple14172021年07月30日 22:20:06 +00:00Commented Jul 30, 2021 at 22:20
1 Answer 1
Use enum class
Consider using enum class
instead of a plain enum
for a little more type safety.
I would also recommend you use all-caps for the names of the enum
options, as this is commonly done and is a strong hint that those are constants and not variables.
Prefer using switch
-statements when enum
s are involved
Use a switch
-statement instead of a chain of if
-else
-statements when checking the value of an enum
variable. Most compilers will then be able to emit warnings if you forget a case
statement. It also makes the code a little bit more readable.
Make preprocess_file()
take istream
and ostream
s as parameters
Your preprocess_file()
has a filename as an argument, and returns a std::stringstream
. However, that means that now it is responsible for opening the file, and returning the result as a std::stringstream
is inefficient if you are just going to write it to std::out
afterwards. Consider having this function take two parameters: one for the input stream, and one for the output stream, like so:
void preprocess(std::istream &file, std::ostream &processed) {
...
}
Then in main()
you can write:
preprocess(std::ifstream("example.xml"), std::cout);
std::cout << "\n";
If you want to be able to write std::cout << preprocess(...)
, that is possible too, but then you have to make it a class
and add a friend operator<<(std::ostream &, preprocess &)
function, which might be a bit overkill for your application.
Missing error checking
I/O errors can happen at any time, both when reading from and writing to a file. You could add checks after every I/O operation, but that would quickly lead to a lot of messy code. Luckily, any error state persists, so what you can do is just add checks at the end of preprocess_file()
, like so:
if (file.fail()) {
throw std::runtime_error("Error while reading input");
}
if (processed.fail()) {
throw std::runtime_error("Error while writing output");
}
Performance
You are going over each character individually, most of the time just copying it to the output. For each state you only have a few characters you want to match. Consider using std::string::find_first_of()
to get the position of the first interesting character; the standard library might do that in a more optimal way.
-
1\$\begingroup\$ Lot of good tips, thanks. I realized I didn't mention it in the post, but for what it's worth I was looking at feeding pugixml from a stream, hence why I left it outputting the stringstream. I can definitely see how having the parent function manage ownership would be better though. \$\endgroup\$apple1417– apple14172021年07月30日 22:29:22 +00:00Commented Jul 30, 2021 at 22:29