Preprocessing invalid xml before feeding it to a parser

Question 1

As part of a larger project I'm writing to learn C++, I'm trying to read through some almost-XML files generated by another program. Unfortunately this program uses its own custom escaping logic, so some of these files aren't actually valid XML. As far as I can tell, its escape logic works like this:

Attribute values are always surrounded with double quotes. The only escape sequence is \", evaluating to a double quote. Backslashes elsewhere are just interpreted literally; it's impossible for a value to end with a backslash.
Nested elements always have the opening outer tag on one line, each inner tag on its own line, and the closing outer tag on another.
Tags with text content are always entirely on a single line, and interpret all characters between the opening tag and closing tag literally. It is impossible for a copy of the closing tag to be in the text content, but other tags may be.

In practice the files have a bit more structure than this, but I've been trying to work off of just these assumptions to keep the code flexible.

As an example, this is a possible, invalid XML, input:

<outer attr="&amp;value\"">
 <empty />
 <inner><notnested></outer></inner>
</outer>

Which is equivalent to this valid XML:

<outer attr="&amp;amp;value&quot;">
 <empty />
 <inner>&lt;notnested&gt;&lt;/outer&gt;</inner>
</outer>

My approach to dealing with this is to do some simplified parsing over these files to convert them into valid XML, before feeding them into a more traditional XML parser (leaning towards pugixml, but open to change if it makes things easier). Here are the relevant functions in a small demo program:

#include <cctype>
#include <fstream>
#include <iostream>
#include <sstream>
#include <string>
static void put_xml_escaped(char c, std::stringstream& stream) {
 switch (c) {
 case '"': stream << "&quot;"; break;
 case '\'': stream << "&apos;"; break;
 case '<': stream << "&lt;"; break;
 case '>': stream << "&gt;"; break;
 case '&': stream << "&amp;"; break;
 default: stream << c; break;
 }
}
static std::stringstream preprocess_file(std::string filename) {
 enum tag_parse_state {
 // " <tag attr="val">content</tag> "
 before_tag, // ^^^^
 tag_name, // ^^^^
 tag_body, // ^^^^^^^ ^
 attr, // ^^^^
 content, // ^^^^^^^
 closing_tag, // ^^^^^^
 closed // ^^^^
 };
 std::ifstream file{filename};
 std::stringstream processed{};
 for (std::string line; std::getline(file, line); ) {
 tag_parse_state state = before_tag;
 size_t tag_name_start;
 size_t closing_tag_start, closing_tag_end;
 for (size_t i = 0; i < line.size(); i++) {
 auto c = line[i];
 if (state == before_tag) {
 if (c == '<') {
 processed << c;
 state = tag_name;
 tag_name_start = i + 1;
 } else if (!isspace(c)) {
 throw std::runtime_error(
 "Failed to parse line (text before inital tag): " + line
 );
 }
 } else if (state == tag_name) {
 processed << c;
 if (c == ' ' || c == '>') {
 // Nice little coincidence: if we're parsing over a closing tag, this will
 // double up on the '/' and thus not find itself
 auto closing_tag_str = (
 "</" + line.substr(tag_name_start, i - tag_name_start) + ">"
 );
 closing_tag_start = line.rfind(closing_tag_str);
 closing_tag_end = (
 closing_tag_start == std::string::npos
 ? std::string::npos
 : closing_tag_start + closing_tag_str.size() - 1
 );
 if (c == ' ') {
 state = tag_body;
 } else {
 state = closing_tag_start == std::string::npos ? closed : content;
 }
 }
 } else if (state == tag_body) {
 processed << c;
 if (c == '"') {
 state = attr;
 } else if (c == '>') {
 state = closing_tag_start == std::string::npos ? closed : content;
 }
 } else if (state == attr) {
 if (c == '"') {
 processed << c;
 state = tag_body;
 } else if (c == '\\' && i + 1 < line.size() && line[i + 1] == '"') {
 put_xml_escaped('"', processed);
 i++;
 } else {
 put_xml_escaped(c, processed);
 }
 } else if (state == content) {
 put_xml_escaped(c, processed);
 if (i + 1 >= closing_tag_start) {
 state = closing_tag;
 }
 } else if (state == closing_tag) {
 processed << c;
 if (i >= closing_tag_end) {
 state = closed;
 }
 } else if (state == closed) {
 if (!isspace(c)) {
 throw std::runtime_error(
 "Failed to parse line (text after closing tag): " + line
 );
 }
 }
 }
 if (state != closed) {
 throw std::runtime_error("Failed to parse line (missed closing tag): " + line);
 }
 }
 return processed;
}
int main() {
 std::cout << preprocess_file("example.xml").str() << "\n";
 return 0;
}

I'm interested in feedback on:

General best practices (minus code style)
Edge cases I may have missed
Algorithmic improvements - I don't think you can do much better than O(n) when using a state machine like this, but maybe there's a completely different approach.

Question 2

This isn't really an XML question. It's about a proprietary data syntax that has some resemblances to XML, but you can't use any XML tooling, so this doesn't help you.

Question 3

Fair enough, removed that tag.

Question 4

Use `enum class`

Consider using enum class instead of a plain enum for a little more type safety.

I would also recommend you use all-caps for the names of the enum options, as this is commonly done and is a strong hint that those are constants and not variables.

Prefer using `switch`-statements when `enum`s are involved

Use a switch-statement instead of a chain of if-else-statements when checking the value of an enum variable. Most compilers will then be able to emit warnings if you forget a case statement. It also makes the code a little bit more readable.

Make `preprocess_file()` take `istream` and `ostream`s as parameters

Your preprocess_file() has a filename as an argument, and returns a std::stringstream. However, that means that now it is responsible for opening the file, and returning the result as a std::stringstream is inefficient if you are just going to write it to std::out afterwards. Consider having this function take two parameters: one for the input stream, and one for the output stream, like so:

void preprocess(std::istream &file, std::ostream &processed) {
 ...
}

Then in main() you can write:

preprocess(std::ifstream("example.xml"), std::cout);
std::cout << "\n";

If you want to be able to write std::cout << preprocess(...), that is possible too, but then you have to make it a class and add a friend operator<<(std::ostream &, preprocess &) function, which might be a bit overkill for your application.

Missing error checking

I/O errors can happen at any time, both when reading from and writing to a file. You could add checks after every I/O operation, but that would quickly lead to a lot of messy code. Luckily, any error state persists, so what you can do is just add checks at the end of preprocess_file(), like so:

if (file.fail()) {
 throw std::runtime_error("Error while reading input");
}
if (processed.fail()) {
 throw std::runtime_error("Error while writing output");
}

Performance

You are going over each character individually, most of the time just copying it to the output. For each state you only have a few characters you want to match. Consider using std::string::find_first_of() to get the position of the first interesting character; the standard library might do that in a more optimal way.

Question 5

Lot of good tips, thanks. I realized I didn't mention it in the post, but for what it's worth I was looking at feeding pugixml from a stream, hence why I left it outputting the stringstream. I can definitely see how having the parent function manage ownership would be better though.

G. Sliepen G. Sliepen 68.7k3 gold badges74 silver badges179 bronze badges · Answer 1 · 2021-07-30 16:12:22Z

Use `enum class`

Consider using enum class instead of a plain enum for a little more type safety.

I would also recommend you use all-caps for the names of the enum options, as this is commonly done and is a strong hint that those are constants and not variables.

Prefer using `switch`-statements when `enum`s are involved

Use a switch-statement instead of a chain of if-else-statements when checking the value of an enum variable. Most compilers will then be able to emit warnings if you forget a case statement. It also makes the code a little bit more readable.

Make `preprocess_file()` take `istream` and `ostream`s as parameters

Your preprocess_file() has a filename as an argument, and returns a std::stringstream. However, that means that now it is responsible for opening the file, and returning the result as a std::stringstream is inefficient if you are just going to write it to std::out afterwards. Consider having this function take two parameters: one for the input stream, and one for the output stream, like so:

void preprocess(std::istream &file, std::ostream &processed) {
 ...
}

Then in main() you can write:

preprocess(std::ifstream("example.xml"), std::cout);
std::cout << "\n";

If you want to be able to write std::cout << preprocess(...), that is possible too, but then you have to make it a class and add a friend operator<<(std::ostream &, preprocess &) function, which might be a bit overkill for your application.

Missing error checking

I/O errors can happen at any time, both when reading from and writing to a file. You could add checks after every I/O operation, but that would quickly lead to a lot of messy code. Luckily, any error state persists, so what you can do is just add checks at the end of preprocess_file(), like so:

if (file.fail()) {
 throw std::runtime_error("Error while reading input");
}
if (processed.fail()) {
 throw std::runtime_error("Error while writing output");
}

Performance

You are going over each character individually, most of the time just copying it to the output. For each state you only have a few characters you want to match. Consider using std::string::find_first_of() to get the position of the first interesting character; the standard library might do that in a more optimal way.

Lot of good tips, thanks. I realized I didn't mention it in the post, but for what it's worth I was looking at feeding pugixml from a stream, hence why I left it outputting the stringstream. I can definitely see how having the parent function manage ownership would be better though.

Stack Exchange Network

Preprocessing invalid xml before feeding it to a parser

1 Answer 1

Use `enum class`

Prefer using `switch`-statements when `enum`s are involved

Make `preprocess_file()` take `istream` and `ostream`s as parameters

Missing error checking

Performance

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Preprocessing invalid xml before feeding it to a parser

1 Answer 1

Use enum class

Prefer using switch-statements when enums are involved

Make preprocess_file() take istream and ostreams as parameters

Missing error checking

Performance

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Use `enum class`

Prefer using `switch`-statements when `enum`s are involved

Make `preprocess_file()` take `istream` and `ostream`s as parameters