I have a code that converts one xml structure to another. I am removing unnecessary tags and replacing some temporarily. I think I have way too many replace operations and wondering if there is a way to optimize this. Can someone pour in suggestions?
Note: I am not using regex to convert xml to plain text. All I am doing using regex is to do away with certain unwanted tags not supported by target xml format, and temporarily replace a few inline elements and replace them back after converting (so that I dont need to parse the inline elements). After sanitizing the string, I am parsing it using jquery xml parser.
Here's the code snippet:
str = str.replace(/\r?\n|\r/g, ''); //Remove all new line characters
// Replace <strong> tags to retain and convert them in the end
str = str.replace(/<strong>/g, '(strong)');
str = str.replace(/<\/strong>/g, '(/strong)'); //
// Replace <code> tags to retain and convert them in the end
str = str.replace(/<code>/g, '(code)');
str = str.replace(/<\/code>/g, '(/code)');
// Remove these tags as they arent required
str = str.replace(/<ac:rich-text-body>/g, '');
str = str.replace(/<\/ac:rich-text-body>/g, '');
// Remove 'ac:' from macros elements
str = str.replace(/ac:/g, '');
str = str.replace(/<\/*span.*?>/g, ''); //Remove all span tags
str = str.replace(/<\/*div.*?>/g, ''); //Remove all div tags
str = str.replace(/<br.*?>/g, ''); //Remove br tags
str = str.replace(/ /g, ''); //Remove non-breaking white spaces
str = str.replace(/<\/*a.*?>/g, ''); //Remove a tags
str = str.replace(/<\/*u>/g, ''); //Remove u tags
str = str.replace(/<\/*em>/g, ''); //Remove em tags
-
2\$\begingroup\$ Could you provide more context about why you want to perform these transformations? \$\endgroup\$200_success– 200_success2017年06月13日 20:55:35 +00:00Commented Jun 13, 2017 at 20:55
-
\$\begingroup\$ Replace multiple strings with multiple other strings \$\endgroup\$woxxom– woxxom2017年06月14日 10:22:08 +00:00Commented Jun 14, 2017 at 10:22
-
\$\begingroup\$ This doesn't look like converting markup to simpler XML but rather like removing markup and converting it to plain text. \$\endgroup\$t3chb0t– t3chb0t2017年06月14日 16:39:23 +00:00Commented Jun 14, 2017 at 16:39
-
\$\begingroup\$ Okay, I released now that using regex in xml parsing is a huge debate, and a sensitive area! I have put up a clarification and updated my question to avoid the confusion. \$\endgroup\$Sejal Parikh– Sejal Parikh2017年06月15日 05:48:38 +00:00Commented Jun 15, 2017 at 5:48
1 Answer 1
You are using the wrong tool for the job, as explained here:
https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
Use XSLT for this job.
-
\$\begingroup\$ When I wrote the comment asking for more context, I was thinking that XSLT might be more appropriate. That said, I'm not a fan of SO question 1732348, whose top answer is just some silly unjustified rant. Anyway, the code in this question isn't actually trying to parse HTML. \$\endgroup\$200_success– 200_success2017年06月13日 23:09:06 +00:00Commented Jun 13, 2017 at 23:09
-
\$\begingroup\$ Yes, the cited question is more of a rant than an explanation. But code that does
str.replace(/<ac:rich-text-body>/g, '')
is indeed trying to parse XML/HTML using regular expressions, and it will fail for the reasons cited in that post: (a) XML is not a regular language in the computer-science sense of the term, and (b) more practically, you code depends on accidental and non-reliable properties of the input, such as the choice of namespace prefixes, the absence of whitespace in places where it is allowed, etc. Using a proper XML parser will avoid these problems. \$\endgroup\$Michael Kay– Michael Kay2017年06月14日 08:24:29 +00:00Commented Jun 14, 2017 at 8:24 -
\$\begingroup\$ (Of course, these criticisms might not apply if it's a one-off conversion of a single XML document and if you're planning to throw your code away once it's done its job.) \$\endgroup\$Michael Kay– Michael Kay2017年06月14日 08:24:49 +00:00Commented Jun 14, 2017 at 8:24
-
\$\begingroup\$ Okay, more context: This is a markup from confluence (storage format). I am indeed trying to parse it (using jquery xml parser) and then convert it to another rather very strict XML format. But before that I am doing two things: 1. Remove all unnecessary formatting done by span, em, u, and other div tags and get pure xml out. 2. Replace <strong> and <code> tags temporarily, so that I don't need to worry about parsing inline tags, and only parse block elements. I am replacing them back with other inline tags (specific to another customized DITA xml that I have). \$\endgroup\$Sejal Parikh– Sejal Parikh2017年06月14日 09:29:52 +00:00Commented Jun 14, 2017 at 9:29
-
2\$\begingroup\$ A perfect use case for XSLT. \$\endgroup\$Michael Kay– Michael Kay2017年06月14日 11:12:44 +00:00Commented Jun 14, 2017 at 11:12