I want to parse a string which contains HTML text. I want to do it in JavaScript.
I tried the Pure JavaScript HTML Parser library but it seems that it parses the HTML of my current page, not from a string. Because when I try the code below, it changes the title of my page:
var parser = new HTMLtoDOM("<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>", document);
My goal is to extract links from an HTML external page that I read just like a string.
Do you know an API to do it?
17 Answers 17
Create a dummy DOM element and add the string to it. Then, you can manipulate it like any DOM element.
var el = document.createElement( 'html' );
el.innerHTML = "<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>";
el.getElementsByTagName( 'a' ); // Live NodeList of your anchor elements
Edit: adding a jQuery answer to please the fans!
var el = $( '<div></div>' );
el.html("<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>");
$('a', el) // All the anchor elements
23 Comments
document.createElement('html');
to preserve the <head>
and <body>
tags.It's quite simple:
const parser = new DOMParser();
const htmlDoc = parser.parseFromString(txt, 'text/html');
// do whatever you want with htmlDoc.getElementsByTagName('a');
According to MDN, to do this in chrome you need to parse as XML like so:
const parser = new DOMParser();
const htmlDoc = parser.parseFromString(txt, 'text/xml');
// do whatever you want with htmlDoc.getElementsByTagName('a');
(削除) It is currently unsupported by webkit and you'd have to follow Florian's answer, and it is unknown to work in most cases on mobile browsers. (削除ここまで)
Edit: Now widely supported
9 Comments
documentURL
of window
, which most likely differs from the URL of the string.new DOMParser
once and then reuse that same object throughout the rest of your script.EDIT: The solution below is only for HTML "fragments" since html,head and body are removed. I guess the solution for this question is DOMParser's parseFromString() method:
const parser = new DOMParser();
const document = parser.parseFromString(html, "text/html");
For HTML fragments, the solutions listed here works for most HTML, however for certain cases it won't work.
For example try parsing <td>Test</td>
. This one won't work on the div.innerHTML solution nor DOMParser.prototype.parseFromString nor range.createContextualFragment solution. The td tag goes missing and only the text remains.
Only jQuery handles that case well.
So the future solution (MS Edge 13+) is to use template tag:
function parseHTML(html) {
var t = document.createElement('template');
t.innerHTML = html;
return t.content;
}
var documentFragment = parseHTML('<td>Test</td>');
For older browsers I have extracted jQuery's parseHTML() method into an independent gist - https://gist.github.com/Munawwar/6e6362dbdf77c7865a99
2 Comments
<template>
tag. It depends on custom elements which you may also need to polyfill. In fact you might just want to use webcomponents.js to polyfill custom elements, templates, shadow dom, promises, and a few other things all at one go.var doc = new DOMParser().parseFromString(html, "text/html");
var links = doc.querySelectorAll("a");
4 Comments
$
? Also, as mentioned in the linked duplicate, text/html
is not supported very well, and has to be implemented using a polyfill.const parse = Range.prototype.createContextualFragment.bind(document.createRange());
document.body.appendChild( parse('<p><strong>Today is:</strong></p>') ),
document.body.appendChild( parse(`<p style="background: #eee">${new Date()}</p>`) );
Only valid child
Node
s within the parent Node
(start of the Range
) will be parsed. Otherwise, unexpected results may occur:
// <body> is "parent" Node, start of Range
const parseRange = document.createRange();
const parse = Range.prototype.createContextualFragment.bind(parseRange);
// Returns Text "1 2" because td, tr, tbody are not valid children of <body>
parse('<td>1</td> <td>2</td>');
parse('<tr><td>1</td> <td>2</td></tr>');
parse('<tbody><tr><td>1</td> <td>2</td></tr></tbody>');
// Returns <table>, which is a valid child of <body>
parse('<table> <td>1</td> <td>2</td> </table>');
parse('<table> <tr> <td>1</td> <td>2</td> </tr> </table>');
parse('<table> <tbody> <td>1</td> <td>2</td> </tbody> </table>');
// <tr> is parent Node, start of Range
parseRange.setStart(document.createElement('tr'), 0);
// Returns [<td>, <td>] element array
parse('<td>1</td> <td>2</td>');
parse('<tr> <td>1</td> <td>2</td> </tr>');
parse('<tbody> <td>1</td> <td>2</td> </tbody>');
parse('<table> <td>1</td> <td>2</td> </table>');
1 Comment
The fastest way to parse HTML in Chrome and Firefox is Range#createContextualFragment:
var range = document.createRange();
range.selectNode(document.body); // required in Safari
var fragment = range.createContextualFragment('<h1>html...</h1>');
var firstNode = fragment.firstChild;
I would recommend to create a helper function which uses createContextualFragment if available and falls back to innerHTML otherwise.
Benchmark: http://jsperf.com/domparser-vs-createelement-innerhtml/3
5 Comments
innerHTML
, this will execute an <img>
’s onerror
.The following function parseHTML
will return either :
a
Document
when your file starts with a doctype.a
DocumentFragment
when your file doesn't start with a doctype.
The code :
function parseHTML(markup) {
if (markup.toLowerCase().trim().indexOf('<!doctype') === 0) {
var doc = document.implementation.createHTMLDocument("");
doc.documentElement.innerHTML = markup;
return doc;
} else if ('content' in document.createElement('template')) {
// Template tag exists!
var el = document.createElement('template');
el.innerHTML = markup;
return el.content;
} else {
// Template tag doesn't exist!
var docfrag = document.createDocumentFragment();
var el = document.createElement('body');
el.innerHTML = markup;
for (i = 0; 0 < el.childNodes.length;) {
docfrag.appendChild(el.childNodes[i]);
}
return docfrag;
}
}
How to use :
var links = parseHTML('<!doctype html><html><head></head><body><a>Link 1</a><a>Link 2</a></body></html>').getElementsByTagName('a');
6 Comments
trim
method on strings. See stackoverflow.com/q/2308134/3210837.1 Way
Use document.cloneNode()
Performance is:
Call to document.cloneNode()
took ~0.22499999977299012 milliseconds.
and maybe will be more.
var t0, t1, html;
t0 = performance.now();
html = document.cloneNode(true);
t1 = performance.now();
console.log("Call to doSomething took " + (t1 - t0) + " milliseconds.")
html.documentElement.innerHTML = '<!DOCTYPE html><html><head><title>Test</title></head><body><div id="test1">test1</div></body></html>';
console.log(html.getElementById("test1"));
2 Way
Use document.implementation.createHTMLDocument()
Performance is:
Call to document.implementation.createHTMLDocument()
took ~0.14000000010128133 milliseconds.
var t0, t1, html;
t0 = performance.now();
html = document.implementation.createHTMLDocument("test");
t1 = performance.now();
console.log("Call to doSomething took " + (t1 - t0) + " milliseconds.")
html.documentElement.innerHTML = '<!DOCTYPE html><html><head><title>Test</title></head><body><div id="test1">test1</div></body></html>';
console.log(html.getElementById("test1"));
3 Way
Use document.implementation.createDocument()
Performance is:
Call to document.implementation.createHTMLDocument()
took ~0.14000000010128133 milliseconds.
var t0 = performance.now();
html = document.implementation.createDocument('', 'html',
document.implementation.createDocumentType('html', '', '')
);
var t1 = performance.now();
console.log("Call to doSomething took " + (t1 - t0) + " milliseconds.")
html.documentElement.innerHTML = '<html><head><title>Test</title></head><body><div id="test1">test</div></body></html>';
console.log(html.getElementById("test1"));
4 Way
Use new Document()
Performance is:
Call to document.implementation.createHTMLDocument()
took ~0.13499999840860255 milliseconds.
- Note
ParentNode.append
is experimental technology in 2020 year.
var t0, t1, html;
t0 = performance.now();
//---------------
html = new Document();
html.append(
html.implementation.createDocumentType('html', '', '')
);
html.append(
html.createElement('html')
);
//---------------
t1 = performance.now();
console.log("Call to doSomething took " + (t1 - t0) + " milliseconds.")
html.documentElement.innerHTML = '<html><head><title>Test</title></head><body><div id="test1">test1</div></body></html>';
console.log(html.getElementById("test1"));
Comments
To do this in node.js, you can use an HTML parser like node-html-parser. The syntax looks like this:
import { parse } from 'node-html-parser';
const root = parse('<ul id="list"><li>Hello World</li></ul>');
console.log(root.firstChild.structure);
// ul#list
// li
// #text
console.log(root.querySelector('#list'));
// { tagName: 'ul',
// rawAttrs: 'id="list"',
// childNodes:
// [ { tagName: 'li',
// rawAttrs: '',
// childNodes: [Object],
// classNames: [] } ],
// id: 'list',
// classNames: [] }
console.log(root.toString());
// <ul id="list"><li>Hello World</li></ul>
root.set_content('<li>Hello World</li>');
root.toString(); // <li>Hello World</li>
5 Comments
(await import("https://cdn.skypack.dev/node-html-parser")).default('<ul id="list"><li>Hello World</li></ul>').firstChild.structure
I think the best way is use this API like this:
//Table string in HTML format
const htmlString = '<table><tbody><tr><td>Cell 1</td><td>Cell 2</td></tr></tbody></table>';
//Parse using DOMParser native way
const parser = new DOMParser();
const $newTable = parser.parseFromString(htmlString, 'text/html');
//Here you can select parts of your parsed html and work with it
const $row = $newTable.querySelector('table > tbody > tr');
//Here i'm printing the number of columns (2)
const $containerHtml = document.getElementById('containerHtml');
$containerHtml.innerHTML = ['Your parsed table have ', $row.cells.length, 'columns.'].join(' ');
<div id="containerHtml"></div>
Comments
If you're open to using jQuery, it has some nice facilities for creating detached DOM elements from strings of HTML. These can then be queried through the usual means, E.g.:
var html = "<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>";
var anchors = $('<div/>').append(html).find('a').get();
Edit - just saw @Florian's answer which is correct. This is basically exactly what he said, but with jQuery.
Comments
const html =
`<script>
alert('👋 there ! Wanna grab a 🍺');
</script>`;
const scriptEl = document.createRange().createContextualFragment(html);
parent.append(scriptEl);
I found this solution, and i think it's the best solution, it parse the HTML and execute the script inside.
Comments
I used DOMParser class which I referred from this blog.
This will return a HTMLCollection object where we can access the element as DOM element itself. Also easy to insert the code in the HTML document by
document.body.append(...parseHTML(html_string));
const parseHTML = (htmlString) => {
const parser = new DOMParser();
const page = parser.parseFromString(htmlString, 'text/html');
return page.body.children;
};
Thank me later.
1 Comment
page.body
because I needed to query for a specific element and I had an issue doing that when also selecting children
.Modern Browsers support the new parseHTMLUnsafe()
static function:
let myDocument = Document.parseHTMLUnsafe("<p>Hello</p>");
Notes:
<script>
elements are not evaluated (executed) during parsing- resulting document's encoding will always be
UTF-8
- resulting document's URL will be
about:blank
Comments
I had to use innerHTML of an element parsed in popover of Angular NGX Bootstrap popover. This is the solution which worked for me.
public htmlContainer = document.createElement( 'html' );
in constructor
this.htmlContainer.innerHTML = ''; setTimeout(() => { this.convertToArray(); });
convertToArray() {
const shapesHC = document.getElementsByClassName('weekPopUpDummy');
const shapesArrHCSpread = [...(shapesHC as any)];
this.htmlContainer = shapesArrHCSpread[0];
this.htmlContainer.innerHTML = shapesArrHCSpread[0].textContent;
}
in html
<div class="weekPopUpDummy" [popover]="htmlContainer.innerHTML" [adaptivePosition]="false" placement="top" [outsideClick]="true" #popOverHide="bs-popover" [delay]="150" (onHidden)="onHidden(weekEvent)" (onShown)="onShown()">
Comments
function parseElement(raw){
let el = document.createElement('div');
el.innerHTML = raw;
let res = el.querySelector('*');
res.remove();
return res;
}
note: raw string should not be more than 1 element
Comments
let content = "<center><h1>404 Not Found</h1></center>"
let result = $("<div/>").html(content).text()
content: <center><h1>404 Not Found</h1></center>
,
result: "404 Not Found"
doc.getElementsByTagName('a')
to read the links (or evendoc.links
).