Converting Confluence markup to a simpler XML format

Question 1

I have a code that converts one xml structure to another. I am removing unnecessary tags and replacing some temporarily. I think I have way too many replace operations and wondering if there is a way to optimize this. Can someone pour in suggestions?

Note: I am not using regex to convert xml to plain text. All I am doing using regex is to do away with certain unwanted tags not supported by target xml format, and temporarily replace a few inline elements and replace them back after converting (so that I dont need to parse the inline elements). After sanitizing the string, I am parsing it using jquery xml parser.

Here's the code snippet:

 str = str.replace(/\r?\n|\r/g, ''); //Remove all new line characters
 // Replace <strong> tags to retain and convert them in the end
 str = str.replace(/<strong>/g, '(strong)');
 str = str.replace(/<\/strong>/g, '(/strong)'); //
 // Replace <code> tags to retain and convert them in the end
 str = str.replace(/<code>/g, '(code)');
 str = str.replace(/<\/code>/g, '(/code)');
 // Remove these tags as they arent required
 str = str.replace(/<ac:rich-text-body>/g, '');
 str = str.replace(/<\/ac:rich-text-body>/g, '');
 // Remove 'ac:' from macros elements
 str = str.replace(/ac:/g, '');
 str = str.replace(/<\/*span.*?>/g, ''); //Remove all span tags
 str = str.replace(/<\/*div.*?>/g, ''); //Remove all div tags
 str = str.replace(/<br.*?>/g, ''); //Remove br tags
 str = str.replace(/&nbsp;/g, ''); //Remove non-breaking white spaces
 str = str.replace(/<\/*a.*?>/g, ''); //Remove a tags
 str = str.replace(/<\/*u>/g, ''); //Remove u tags
 str = str.replace(/<\/*em>/g, ''); //Remove em tags

Question 2

Could you provide more context about why you want to perform these transformations?

Question 3

Replace multiple strings with multiple other strings

Question 4

This doesn't look like converting markup to simpler XML but rather like removing markup and converting it to plain text.

Question 5

Okay, I released now that using regex in xml parsing is a huge debate, and a sensitive area! I have put up a clarification and updated my question to avoid the confusion.

Question 6

You are using the wrong tool for the job, as explained here:

https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

Use XSLT for this job.

Question 7

When I wrote the comment asking for more context, I was thinking that XSLT might be more appropriate. That said, I'm not a fan of SO question 1732348, whose top answer is just some silly unjustified rant. Anyway, the code in this question isn't actually trying to parse HTML.

Question 8

Yes, the cited question is more of a rant than an explanation. But code that does str.replace(/<ac:rich-text-body>/g, '') is indeed trying to parse XML/HTML using regular expressions, and it will fail for the reasons cited in that post: (a) XML is not a regular language in the computer-science sense of the term, and (b) more practically, you code depends on accidental and non-reliable properties of the input, such as the choice of namespace prefixes, the absence of whitespace in places where it is allowed, etc. Using a proper XML parser will avoid these problems.

Question 9

(Of course, these criticisms might not apply if it's a one-off conversion of a single XML document and if you're planning to throw your code away once it's done its job.)

Question 10

Okay, more context: This is a markup from confluence (storage format). I am indeed trying to parse it (using jquery xml parser) and then convert it to another rather very strict XML format. But before that I am doing two things: 1. Remove all unnecessary formatting done by span, em, u, and other div tags and get pure xml out. 2. Replace <strong> and <code> tags temporarily, so that I don't need to worry about parsing inline tags, and only parse block elements. I am replacing them back with other inline tags (specific to another customized DITA xml that I have).

Question 11

A perfect use case for XSLT.

Michael Kay Michael Kay 6513 silver badges4 bronze badges · Answer 1 · 2017-06-13 22:57:05Z

2

\$\begingroup\$

You are using the wrong tool for the job, as explained here:

https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

Use XSLT for this job.

Share

answered Jun 13, 2017 at 22:57

Michael Kay's user avatar

Michael Kay Michael Kay

6513 silver badges4 bronze badges

\$\endgroup\$

8

\$\begingroup\$ When I wrote the comment asking for more context, I was thinking that XSLT might be more appropriate. That said, I'm not a fan of SO question 1732348, whose top answer is just some silly unjustified rant. Anyway, the code in this question isn't actually trying to parse HTML. \$\endgroup\$

200_success
– 200_success

2017年06月13日 23:09:06 +00:00
Commented Jun 13, 2017 at 23:09
\$\begingroup\$ Yes, the cited question is more of a rant than an explanation. But code that does str.replace(/<ac:rich-text-body>/g, '') is indeed trying to parse XML/HTML using regular expressions, and it will fail for the reasons cited in that post: (a) XML is not a regular language in the computer-science sense of the term, and (b) more practically, you code depends on accidental and non-reliable properties of the input, such as the choice of namespace prefixes, the absence of whitespace in places where it is allowed, etc. Using a proper XML parser will avoid these problems. \$\endgroup\$

Michael Kay
– Michael Kay

2017年06月14日 08:24:29 +00:00
Commented Jun 14, 2017 at 8:24
\$\begingroup\$ (Of course, these criticisms might not apply if it's a one-off conversion of a single XML document and if you're planning to throw your code away once it's done its job.) \$\endgroup\$

Michael Kay
– Michael Kay

2017年06月14日 08:24:49 +00:00
Commented Jun 14, 2017 at 8:24
\$\begingroup\$ Okay, more context: This is a markup from confluence (storage format). I am indeed trying to parse it (using jquery xml parser) and then convert it to another rather very strict XML format. But before that I am doing two things: 1. Remove all unnecessary formatting done by span, em, u, and other div tags and get pure xml out. 2. Replace <strong> and <code> tags temporarily, so that I don't need to worry about parsing inline tags, and only parse block elements. I am replacing them back with other inline tags (specific to another customized DITA xml that I have). \$\endgroup\$

Sejal Parikh
– Sejal Parikh

2017年06月14日 09:29:52 +00:00
Commented Jun 14, 2017 at 9:29
2

\$\begingroup\$ A perfect use case for XSLT. \$\endgroup\$

Michael Kay
– Michael Kay

2017年06月14日 11:12:44 +00:00
Commented Jun 14, 2017 at 11:12

| Show 3 more comments

Stack Exchange Network

Converting Confluence markup to a simpler XML format

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Converting Confluence markup to a simpler XML format

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions