2
\$\begingroup\$

I'm working on an application that allows users edit/fix XML. A part of this is to format the XML for better readability.

As the XML might be invalid, the existing methods I found for formatting (like XmlWriter or XDocument) don't work for me.
There might be all sorts of problems with the XML, although the most common is unescaped special characters.

public static string FormatXml(string xml)
{
 var tags = xml
 .Split('<')
 .Select(tag => tag.TrimEnd().EndsWith(">") ? tag.TrimEnd() : tag); //Trim whitespace between tags, but not at the end of values
 var previousTag = tags.First(); //Preserve content before the first tag, e.g. if the initial < is missing
 var formattedXml = new StringBuilder(previousTag);
 var indention = 0;
 
 foreach (var tag in tags.Skip(1))
 {
 if (previousTag.EndsWith(">"))
 {
 formattedXml.AppendLine();
 if (tag.StartsWith("/"))
 {
 indention = Math.Max(indention - 1, 0);
 formattedXml.Append(new string('\t', indention));
 }
 else
 {
 formattedXml.Append(new string('\t', indention));
 if (!tag.EndsWith("/>"))
 {
 indention++;
 }
 }
 }
 else
 {
 indention = Math.Max(indention - 1, 0);
 }
 formattedXml.Append("<");
 formattedXml.Append(tag);
 previousTag = tag;
 }
 return formattedXml.ToString();
}

Sofar the method produces reasonable output for all cases I came up with.

I'm mostly worried that I missed some special cases of valid XML that would get messed up.

asked Dec 2, 2020 at 11:36
\$\endgroup\$
2
  • \$\begingroup\$ Is the xml passed to the method before or after the user edit the xml? \$\endgroup\$ Commented Dec 2, 2020 at 15:01
  • \$\begingroup\$ @Heslacher: The method is invoked by the user through a 'Format XML' button. \$\endgroup\$ Commented Dec 2, 2020 at 16:13

1 Answer 1

5
\$\begingroup\$

There's a test suite of 2000 test cases available at https://www.w3.org/XML/Test/ - try it out.

From a quick glance, it's not clear to me how you're handling content within comments or CDATA sections - which might be well-formed XML, or it might be something approximating to well-formed XML.

Another comment is that messing with whitespace is dangerous in mixed content. With inline markup (bold, italic etc) preserving whitespace as written may be important.

answered Dec 2, 2020 at 16:36
\$\endgroup\$
1
  • \$\begingroup\$ +1 I have a look a the test cases. Mixed content might be problematic. In my specific use case it's not a concern, but generally my code would need some major cases to handle this. \$\endgroup\$ Commented Dec 3, 2020 at 10:23

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.