Comparison of HTML parsers
Appearance
From Wikipedia, the free encyclopedia
This article has multiple issues. Please help improve it or discuss these issues on the talk page . (Learn how and when to remove these messages)
(Learn how and when to remove this message)This article needs additional citations for verification . Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Comparison of HTML parsers" – news · newspapers · books · scholar · JSTOR (May 2015) (Learn how and when to remove this message)
Find sources: "Comparison of HTML parsers" – news · newspapers · books · scholar · JSTOR (May 2015) (Learn how and when to remove this message)
This article possibly contains original research . Please improve it by verifying the claims made and adding inline citations. Statements consisting only of original research should be removed. (May 2015) (Learn how and when to remove this message)
HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:
- HTML traversal: offer an interface for programmers to easily access and modify the "HTML string code". Canonical example: DOM parsers.
- HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: HTML Tidy.
Parser | License | Implementation language(s) | Latest date* | HTML parsing[1] | HTML5-compliant parsing | Clean HTML** | Update HTML*** |
---|---|---|---|---|---|---|---|
HTML Tidy | W3C license | ANSI C | 2021年07月17日[2] | Yes[3] | Yes | Yes[3] | Yes |
HtmlUnit | Apache License 2.0 | Java | 2023年10月31日[4] | Yes | ? | No | No |
Beautiful Soup | MIT License | Python | 2023年04月07日[5] | Yes | Yes | ? | No |
jsoup | MIT License | Java | 2025年08月25日[6] | Yes | Yes | Yes | Yes |
Parser | License | Implementation language(s) | Latest date* | HTML Parsing | HTML5-compliant Parsing | Clean HTML** | Update HTML*** |
- * Latest release (of significant changes) date.
- ** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code.
- *** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with
style="text-align:center;"
).
References
[edit ]- ^ "HTML Standard". html.spec.whatwg.org. Archived from the original on January 16, 2013.
- ^ "Release 5.8.0 · htacg/tidy-html5". GitHub.
- ^ a b "HTML Tidy". www.html-tidy.org.
- ^ "Release HtmlUnit 3.7.0 · HtmlUnit/htmlunit". GitHub.
- ^ "Index of /software/BeautifulSoup/bs4/download/4.12". www.crummy.com.
- ^ "jsoup release 1.21.2 (2025-Aug-25)". jsoup.org.