0
\$\begingroup\$

I have a code that converts one xml structure to another. I am removing unnecessary tags and replacing some temporarily. I think I have way too many replace operations and wondering if there is a way to optimize this. Can someone pour in suggestions?

Note: I am not using regex to convert xml to plain text. All I am doing using regex is to do away with certain unwanted tags not supported by target xml format, and temporarily replace a few inline elements and replace them back after converting (so that I dont need to parse the inline elements). After sanitizing the string, I am parsing it using jquery xml parser.

Here's the code snippet:

 str = str.replace(/\r?\n|\r/g, ''); //Remove all new line characters
 // Replace <strong> tags to retain and convert them in the end
 str = str.replace(/<strong>/g, '(strong)');
 str = str.replace(/<\/strong>/g, '(/strong)'); //
 // Replace <code> tags to retain and convert them in the end
 str = str.replace(/<code>/g, '(code)');
 str = str.replace(/<\/code>/g, '(/code)');
 // Remove these tags as they arent required
 str = str.replace(/<ac:rich-text-body>/g, '');
 str = str.replace(/<\/ac:rich-text-body>/g, '');
 // Remove 'ac:' from macros elements
 str = str.replace(/ac:/g, '');
 str = str.replace(/<\/*span.*?>/g, ''); //Remove all span tags
 str = str.replace(/<\/*div.*?>/g, ''); //Remove all div tags
 str = str.replace(/<br.*?>/g, ''); //Remove br tags
 str = str.replace(/&nbsp;/g, ''); //Remove non-breaking white spaces
 str = str.replace(/<\/*a.*?>/g, ''); //Remove a tags
 str = str.replace(/<\/*u>/g, ''); //Remove u tags
 str = str.replace(/<\/*em>/g, ''); //Remove em tags
asked Jun 13, 2017 at 20:13
\$\endgroup\$
4
  • 2
    \$\begingroup\$ Could you provide more context about why you want to perform these transformations? \$\endgroup\$ Commented Jun 13, 2017 at 20:55
  • \$\begingroup\$ Replace multiple strings with multiple other strings \$\endgroup\$ Commented Jun 14, 2017 at 10:22
  • \$\begingroup\$ This doesn't look like converting markup to simpler XML but rather like removing markup and converting it to plain text. \$\endgroup\$ Commented Jun 14, 2017 at 16:39
  • \$\begingroup\$ Okay, I released now that using regex in xml parsing is a huge debate, and a sensitive area! I have put up a clarification and updated my question to avoid the confusion. \$\endgroup\$ Commented Jun 15, 2017 at 5:48

1 Answer 1

2
\$\begingroup\$

You are using the wrong tool for the job, as explained here:

https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

Use XSLT for this job.

answered Jun 13, 2017 at 22:57
\$\endgroup\$
8
  • \$\begingroup\$ When I wrote the comment asking for more context, I was thinking that XSLT might be more appropriate. That said, I'm not a fan of SO question 1732348, whose top answer is just some silly unjustified rant. Anyway, the code in this question isn't actually trying to parse HTML. \$\endgroup\$ Commented Jun 13, 2017 at 23:09
  • \$\begingroup\$ Yes, the cited question is more of a rant than an explanation. But code that does str.replace(/<ac:rich-text-body>/g, '') is indeed trying to parse XML/HTML using regular expressions, and it will fail for the reasons cited in that post: (a) XML is not a regular language in the computer-science sense of the term, and (b) more practically, you code depends on accidental and non-reliable properties of the input, such as the choice of namespace prefixes, the absence of whitespace in places where it is allowed, etc. Using a proper XML parser will avoid these problems. \$\endgroup\$ Commented Jun 14, 2017 at 8:24
  • \$\begingroup\$ (Of course, these criticisms might not apply if it's a one-off conversion of a single XML document and if you're planning to throw your code away once it's done its job.) \$\endgroup\$ Commented Jun 14, 2017 at 8:24
  • \$\begingroup\$ Okay, more context: This is a markup from confluence (storage format). I am indeed trying to parse it (using jquery xml parser) and then convert it to another rather very strict XML format. But before that I am doing two things: 1. Remove all unnecessary formatting done by span, em, u, and other div tags and get pure xml out. 2. Replace <strong> and <code> tags temporarily, so that I don't need to worry about parsing inline tags, and only parse block elements. I am replacing them back with other inline tags (specific to another customized DITA xml that I have). \$\endgroup\$ Commented Jun 14, 2017 at 9:29
  • 2
    \$\begingroup\$ A perfect use case for XSLT. \$\endgroup\$ Commented Jun 14, 2017 at 11:12

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.