Parse an HTML string with JS

Question 1

I want to parse a string which contains HTML text. I want to do it in JavaScript.

I tried the Pure JavaScript HTML Parser library but it seems that it parses the HTML of my current page, not from a string. Because when I try the code below, it changes the title of my page:

var parser = new HTMLtoDOM("<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>", document);

My goal is to extract links from an HTML external page that I read just like a string.

Do you know an API to do it?

Question 2

possible duplicate of JavaScript DOMParser access innerHTML and other properties

Question 3

The method on the linked duplicate creates a HTML document from a given string. Then, you can use doc.getElementsByTagName('a') to read the links (or even doc.links).

Question 4

It's worth mentioning that if you're using a framework like React.js then there may be ways of doing it that are specific to the framework such as: stackoverflow.com/questions/23616226/…

Question 5

Does this answer your question? Strip HTML from Text JavaScript

Question 6

Create a dummy DOM element and add the string to it. Then, you can manipulate it like any DOM element.

var el = document.createElement( 'html' );
el.innerHTML = "<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>";
el.getElementsByTagName( 'a' ); // Live NodeList of your anchor elements

Edit: adding a jQuery answer to please the fans!

var el = $( '<div></div>' );
el.html("<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>");
$('a', el) // All the anchor elements

Question 7

Just a note: With this solution, if I do a "alert(el.innerHTML)", I lose the <html>, <body> and <head> tag....

Question 8

@stage I'm a little bit late to the party, but you should be able to use document.createElement('html'); to preserve the <head> and <body> tags.

Question 9

it looks like you are putting an html element within an html element

Question 10

I'm concerned is upvoted as the top answer. The parse() solution below is more reusable and elegant.

Question 11

Security note: this will execute any script in the input, and thus is unsuitable for untrusted input.

Question 12

It's quite simple:

const parser = new DOMParser();
const htmlDoc = parser.parseFromString(txt, 'text/html');
// do whatever you want with htmlDoc.getElementsByTagName('a');

According to MDN, to do this in chrome you need to parse as XML like so:

const parser = new DOMParser();
const htmlDoc = parser.parseFromString(txt, 'text/xml');
// do whatever you want with htmlDoc.getElementsByTagName('a');

~~(削除) It is currently unsupported by webkit and you'd have to follow Florian's answer, and it is unknown to work in most cases on mobile browsers. (削除ここまで)~~

Edit: Now widely supported

Question 13

Worth noting that in 2016 DOMParser is now widely supported. caniuse.com/#feat=xml-serializer

Question 14

Worth noting that all relative links in the created document are broken, because the document gets created by inheriting the documentURL of window, which most likely differs from the URL of the string.

Question 15

Worth noting that you should only call new DOMParser once and then reuse that same object throughout the rest of your script.

Question 16

The parse() solution below is more reusable and specific to HTML. This is nice if you need an XML document, however.

Question 17

Security note: this will execute without any browser context, so no scripts will run. It should be suitable for untrusted input.

Question 18

EDIT: The solution below is only for HTML "fragments" since html,head and body are removed. I guess the solution for this question is DOMParser's parseFromString() method:

const parser = new DOMParser();
const document = parser.parseFromString(html, "text/html");

For HTML fragments, the solutions listed here works for most HTML, however for certain cases it won't work.

For example try parsing <td>Test</td>. This one won't work on the div.innerHTML solution nor DOMParser.prototype.parseFromString nor range.createContextualFragment solution. The td tag goes missing and only the text remains.

Only jQuery handles that case well.

So the future solution (MS Edge 13+) is to use template tag:

function parseHTML(html) {
 var t = document.createElement('template');
 t.innerHTML = html;
 return t.content;
}
var documentFragment = parseHTML('<td>Test</td>');

For older browsers I have extracted jQuery's parseHTML() method into an independent gist - https://gist.github.com/Munawwar/6e6362dbdf77c7865a99

Question 19

If you want to write forward-compatible code that also works on old browsers you can polyfill the <template> tag. It depends on custom elements which you may also need to polyfill. In fact you might just want to use webcomponents.js to polyfill custom elements, templates, shadow dom, promises, and a few other things all at one go.

Question 20

Wow. Extremely efficient!

Question 21

var doc = new DOMParser().parseFromString(html, "text/html");
var links = doc.querySelectorAll("a");

Question 22

Why are you prefixing $? Also, as mentioned in the linked duplicate, text/html is not supported very well, and has to be implemented using a polyfill.

Question 23

I copied this line from a project, I'm used to prefix variables with $ in javascript application (not in library). it's just to avoir having a conflict with a library. that's not very usefull as almost every variable is scoped but it used to be usefull. it also (maybe) help to identify variables easily.

Question 24

Sadly DOMParser neither work on text/html in chrome, this MDN page gives workaround.

Question 25

Security note: this will execute without any browser context, so no scripts will run. It should be suitable for untrusted input.

Question 26

const parse = Range.prototype.createContextualFragment.bind(document.createRange());
document.body.appendChild( parse('<p><strong>Today is:</strong></p>') ),
document.body.appendChild( parse(`<p style="background: #eee">${new Date()}</p>`) );

Only valid child Nodes within the parent Node (start of the Range) will be parsed. Otherwise, unexpected results may occur:

// <body> is "parent" Node, start of Range
const parseRange = document.createRange();
const parse = Range.prototype.createContextualFragment.bind(parseRange);
// Returns Text "1 2" because td, tr, tbody are not valid children of <body>
parse('<td>1</td> <td>2</td>');
parse('<tr><td>1</td> <td>2</td></tr>');
parse('<tbody><tr><td>1</td> <td>2</td></tr></tbody>');
// Returns <table>, which is a valid child of <body>
parse('<table> <td>1</td> <td>2</td> </table>');
parse('<table> <tr> <td>1</td> <td>2</td> </tr> </table>');
parse('<table> <tbody> <td>1</td> <td>2</td> </tbody> </table>');
// <tr> is parent Node, start of Range
parseRange.setStart(document.createElement('tr'), 0);
// Returns [<td>, <td>] element array
parse('<td>1</td> <td>2</td>');
parse('<tr> <td>1</td> <td>2</td> </tr>');
parse('<tbody> <td>1</td> <td>2</td> </tbody>');
parse('<table> <td>1</td> <td>2</td> </table>');

Question 27

Security note: this will execute any script in the input, and thus is unsuitable for untrusted input.

Question 28

The fastest way to parse HTML in Chrome and Firefox is Range#createContextualFragment:

var range = document.createRange();
range.selectNode(document.body); // required in Safari
var fragment = range.createContextualFragment('<h1>html...</h1>');
var firstNode = fragment.firstChild;

I would recommend to create a helper function which uses createContextualFragment if available and falls back to innerHTML otherwise.

Benchmark: http://jsperf.com/domparser-vs-createelement-innerhtml/3

Question 29

Note that, like (the simple) innerHTML, this will execute an <img>’s onerror.

Question 30

An issue with this is that, html like '<td>test</td>' would ignore the td in the document.body context (and only create 'test' text node).OTOH, if it used internally in a templating engine then the right context would be available.

Question 31

Also BTW, IE 11 supports createContextualFragment.

Question 32

The question was how to parse with JS - not Chrome or Firefox

Question 33

Security note: this will execute any script in the input, and thus is unsuitable for untrusted input.

Question 34

The following function parseHTML will return either :

a Document when your file starts with a doctype.
a DocumentFragment when your file doesn't start with a doctype.

The code :

function parseHTML(markup) {
 if (markup.toLowerCase().trim().indexOf('<!doctype') === 0) {
 var doc = document.implementation.createHTMLDocument("");
 doc.documentElement.innerHTML = markup;
 return doc;
 } else if ('content' in document.createElement('template')) {
 // Template tag exists!
 var el = document.createElement('template');
 el.innerHTML = markup;
 return el.content;
 } else {
 // Template tag doesn't exist!
 var docfrag = document.createDocumentFragment();
 var el = document.createElement('body');
 el.innerHTML = markup;
 for (i = 0; 0 < el.childNodes.length;) {
 docfrag.appendChild(el.childNodes[i]);
 }
 return docfrag;
 }
}

How to use :

var links = parseHTML('<!doctype html><html><head></head><body><a>Link 1</a><a>Link 2</a></body></html>').getElementsByTagName('a');

Question 35

I couldn't get this to work on IE8. I get the error "Object doesn't support this property or method" for the first line in the function. I don't think the createHTMLDocument function exists

Question 36

What exactly is your use case? If you just want to parse HTML and your HTML is intended for the body of your document, you could do the following : (1) var div=document.createElement("DIV"); (2) div.innerHTML = markup; (3) result = div.childNodes; --- This gives you a collection of childnodes and should work not just in IE8 but even in IE6-7.

Question 37

Thanks for the alternate option, I'll try it if I need to do this again. For now though I used the JQuery solution above.

Question 38

@SebastianCarroll Note that IE8 doesn't support the trim method on strings. See stackoverflow.com/q/2308134/3210837.

Question 39

@Toothbrush : Is IE8 support still relevant at the dawn of 2017?

Question 40

1 Way

Use document.cloneNode()

Performance is:

Call to document.cloneNode() took ~0.22499999977299012 milliseconds.

and maybe will be more.

var t0, t1, html;
t0 = performance.now();
 html = document.cloneNode(true);
t1 = performance.now();
console.log("Call to doSomething took " + (t1 - t0) + " milliseconds.")
html.documentElement.innerHTML = '<!DOCTYPE html><html><head><title>Test</title></head><body><div id="test1">test1</div></body></html>';
console.log(html.getElementById("test1"));

2 Way

Use document.implementation.createHTMLDocument()

Performance is:

Call to document.implementation.createHTMLDocument() took ~0.14000000010128133 milliseconds.

var t0, t1, html;
t0 = performance.now();
html = document.implementation.createHTMLDocument("test");
t1 = performance.now();
console.log("Call to doSomething took " + (t1 - t0) + " milliseconds.")
html.documentElement.innerHTML = '<!DOCTYPE html><html><head><title>Test</title></head><body><div id="test1">test1</div></body></html>';
console.log(html.getElementById("test1"));

3 Way

Use document.implementation.createDocument()

Performance is:

Call to document.implementation.createHTMLDocument() took ~0.14000000010128133 milliseconds.

var t0 = performance.now();
 html = document.implementation.createDocument('', 'html', 
 document.implementation.createDocumentType('html', '', '')
 );
var t1 = performance.now();
console.log("Call to doSomething took " + (t1 - t0) + " milliseconds.")
html.documentElement.innerHTML = '<html><head><title>Test</title></head><body><div id="test1">test</div></body></html>';
console.log(html.getElementById("test1"));

4 Way

Use new Document()

Performance is:

Call to document.implementation.createHTMLDocument() took ~0.13499999840860255 milliseconds.

Note

ParentNode.append is experimental technology in 2020 year.

var t0, t1, html;
t0 = performance.now();
//---------------
html = new Document();
html.append(
 html.implementation.createDocumentType('html', '', '')
);
 
html.append(
 html.createElement('html')
);
//---------------
t1 = performance.now();
console.log("Call to doSomething took " + (t1 - t0) + " milliseconds.")
html.documentElement.innerHTML = '<html><head><title>Test</title></head><body><div id="test1">test1</div></body></html>';
console.log(html.getElementById("test1"));

Question 41

To do this in node.js, you can use an HTML parser like node-html-parser. The syntax looks like this:

import { parse } from 'node-html-parser';
const root = parse('<ul id="list"><li>Hello World</li></ul>');
console.log(root.firstChild.structure);
// ul#list
// li
// #text
console.log(root.querySelector('#list'));
// { tagName: 'ul',
// rawAttrs: 'id="list"',
// childNodes:
// [ { tagName: 'li',
// rawAttrs: '',
// childNodes: [Object],
// classNames: [] } ],
// id: 'list',
// classNames: [] }
console.log(root.toString());
// <ul id="list"><li>Hello World</li></ul>
root.set_content('<li>Hello World</li>');
root.toString(); // <li>Hello World</li>

Question 42

This is the best solution even on the browser, if you do not want to rely on the browser implementation.. This implementation will behave always the same no matter which browser you are on (not that it matters much nowdays), but also the parsing is done in javascript itself instead of c/c++!

Question 43

Thanks @Rainb. How do you use the solution in the browser though?

Question 44

like this

(await import("https://cdn.skypack.dev/node-html-parser")).default('<ul id="list"><li>Hello World</li></ul>').firstChild.structure

Question 45

I never knew that was an option. Can you do that with any node library, or is it because this one doesn't use any node-only code?

Question 46

if it requires anything from node like tls, http, net, fs then it probably won't work in the browser. But it won't work in deno either. So just look for deno compatible packages.

Question 47

I think the best way is use this API like this:

//Table string in HTML format
const htmlString = '<table><tbody><tr><td>Cell 1</td><td>Cell 2</td></tr></tbody></table>';
//Parse using DOMParser native way
const parser = new DOMParser();
const $newTable = parser.parseFromString(htmlString, 'text/html');
//Here you can select parts of your parsed html and work with it
const $row = $newTable.querySelector('table > tbody > tr');
//Here i'm printing the number of columns (2)
const $containerHtml = document.getElementById('containerHtml');
$containerHtml.innerHTML = ['Your parsed table have ', $row.cells.length, 'columns.'].join(' ');

<div id="containerHtml"></div>

Question 48

If you're open to using jQuery, it has some nice facilities for creating detached DOM elements from strings of HTML. These can then be queried through the usual means, E.g.:

var html = "<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>";
var anchors = $('<div/>').append(html).find('a').get();

Edit - just saw @Florian's answer which is correct. This is basically exactly what he said, but with jQuery.

Question 49

const html =
`<script>
 alert('👋 there ! Wanna grab a 🍺'); 
</script>`;
const scriptEl = document.createRange().createContextualFragment(html);
parent.append(scriptEl);

I found this solution, and i think it's the best solution, it parse the HTML and execute the script inside.

Question 50

I used DOMParser class which I referred from this blog.

This will return a HTMLCollection object where we can access the element as DOM element itself. Also easy to insert the code in the HTML document by document.body.append(...parseHTML(html_string));

const parseHTML = (htmlString) => {
 const parser = new DOMParser();
 const page = parser.parseFromString(htmlString, 'text/html');
 return page.body.children;
};

Thank me later.

Question 51

this worked for me, but I used page.body because I needed to query for a specific element and I had an issue doing that when also selecting children.

Question 52

Modern Browsers support the new parseHTMLUnsafe() static function:

let myDocument = Document.parseHTMLUnsafe("<p>Hello</p>");

Notes:

<script> elements are not evaluated (executed) during parsing
resulting document's encoding will always be UTF-8
resulting document's URL will be about:blank

Question 53

I had to use innerHTML of an element parsed in popover of Angular NGX Bootstrap popover. This is the solution which worked for me.

public htmlContainer = document.createElement( 'html' );

in constructor

this.htmlContainer.innerHTML = ''; setTimeout(() => { this.convertToArray(); });

 convertToArray() {
 const shapesHC = document.getElementsByClassName('weekPopUpDummy');
 const shapesArrHCSpread = [...(shapesHC as any)];
 this.htmlContainer = shapesArrHCSpread[0];
 this.htmlContainer.innerHTML = shapesArrHCSpread[0].textContent;
 }

in html

<div class="weekPopUpDummy" [popover]="htmlContainer.innerHTML" [adaptivePosition]="false" placement="top" [outsideClick]="true" #popOverHide="bs-popover" [delay]="150" (onHidden)="onHidden(weekEvent)" (onShown)="onShown()">

Question 54

function parseElement(raw){
 let el = document.createElement('div');
 el.innerHTML = raw;
 let res = el.querySelector('*');
 res.remove();
 return res;
}

note: raw string should not be more than 1 element

Question 55

let content = "<center><h1>404 Not Found</h1></center>"
let result = $("<div/>").html(content).text()

content: <center><h1>404 Not Found</h1></center>,
result: "404 Not Found"

Question 56

This does not answer the Quest. OP wants to extract links.

Florian Margaine 61.3k15 gold badges94 silver badges120 bronze badges · Accepted Answer · 2012-05-14 14:14:36Z

515

Create a dummy DOM element and add the string to it. Then, you can manipulate it like any DOM element.

var el = document.createElement( 'html' );
el.innerHTML = "<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>";
el.getElementsByTagName( 'a' ); // Live NodeList of your anchor elements

Edit: adding a jQuery answer to please the fans!

var el = $( '<div></div>' );
el.html("<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>");
$('a', el) // All the anchor elements

Share

Improve this answer

edited May 20, 2015 at 17:42

omninonsense's user avatar

omninonsense

7,01210 gold badges49 silver badges67 bronze badges

answered May 14, 2012 at 14:14

Florian Margaine's user avatar

Florian Margaine

61.3k15 gold badges94 silver badges120 bronze badges

Sign up to request clarification or add additional context in comments.

23 Comments

stage

stage Over a year ago

Just a note: With this solution, if I do a "alert(el.innerHTML)", I lose the <html>, <body> and <head> tag....

2012年05月14日T15:10:17.663Z+00:00

omninonsense

omninonsense Over a year ago

@stage I'm a little bit late to the party, but you should be able to use document.createElement('html'); to preserve the <head> and <body> tags.

2015年05月20日T17:21:21.667Z+00:00

symbiont

symbiont Over a year ago

it looks like you are putting an html element within an html element

2017年08月16日T11:39:05.45Z+00:00

Justin

Justin Over a year ago

I'm concerned is upvoted as the top answer. The parse() solution below is more reusable and elegant.

2019年03月07日T17:36:47.217Z+00:00

Leif Arne Storset

Leif Arne Storset Over a year ago

Security note: this will execute any script in the input, and thus is unsuitable for untrusted input.

2020年03月11日T13:55:01.667Z+00:00

|

CollectivesTM on Stack Overflow

Parse an HTML string with JS

17 Answers 17

23 Comments

9 Comments

2 Comments

4 Comments

1 Comment

5 Comments

The code :

How to use :

6 Comments

Comments

5 Comments

Comments

Comments

Comments

1 Comment

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

17 Answers 17

23 Comments

9 Comments

2 Comments

4 Comments

1 Comment

5 Comments

The code :

How to use :

6 Comments

Comments

5 Comments

Comments

Comments

Comments

1 Comment

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related