This PR contains the following updates:
Release Notes
jhy/jsoup (org.jsoup:jsoup)
Improvements
- Expanded and clarified
NodeTraversor support for in-place DOM rewrites during NodeVisitor.head(). Current-node edits such as remove, replace, and unwrap now recover more predictably, while traversal stays within the original root subtree. This makes single-pass tree cleanup and normalization visitors easier to write, for example when unwrapping presentational elements or replacing text nodes as you walk the DOM. #2472
- Documentation: clarified that a configured
Cleaner may be reused across concurrent threads, and that shared Safelist instances should not be mutated while in use. #2473
- Updated the default HTML
TagSet for current HTML elements: added dialog, search, picture, and slot; made ins, del, button, audio, video, and canvas inline by default (Tag#isInline(), aligned to phrasing content in the spec); and added readable Element.text() boundaries for controls and embedded objects via the new Tag.TextBoundary option. This improves pretty-printing and keeps normalized text from running adjacent words together. #2493
Bug Fixes
- Android (R8/ProGuard): added a rule to ignore the optional
re2j dependency when not present. #2459
- Fixed a
NodeTraversor regression in 1.21.2 where removing or replacing the current node during head() could revisit the replacement node and loop indefinitely. The traversal docs now also clarify which inserted nodes are visited in the current pass. #2472
- Parsing during charset sniffing no longer fails if an advisory
available() call throws IOException, as seen on JDK 8 HttpURLConnection. #2474
Cleaner no longer makes relative URL attributes in the input document absolute when cleaning or validating a Document. URL normalization now applies only to the cleaned output, and Safelist.isSafeAttribute() is side effect free. #2475
Cleaner no longer duplicates enforced attributes when the input Document preserves attribute case. A case-variant source attribute is now replaced by the enforced attribute in the cleaned output. #2476
- If a per-request SOCKS proxy is configured, jsoup now avoids using the JDK
HttpClient, because the JDK would silently ignore that proxy and attempt to connect directly. Those requests now fall back to the legacy HttpURLConnection transport instead, which does support SOCKS. #2468
Connection.Response.streamParser() and DataUtil.streamParser(Path, ...) could fail on small inputs without a declared charset, if the initial 5 KB charset sniff fully consumed the input and closed it before the stream parse began. #2483
- In XML mode, doctypes with an internal subset, such as
<!DOCTYPE root [<!ENTITY name "value">]>, now round-trip correctly. The subset is preserved as raw text only; entities are not expanded and external DTDs are not loaded. #2486
Build Changes
- Migrated the integration test server from Jetty to Netty, which actively maintains support for our minimum JDK target (8). #2491
Improvements
- Added support for using the
re2j regular expression engine for regex-based CSS selectors (e.g. [attr~=regex], :matches(regex)), which ensures linear-time performance for regex evaluation. This allows safer handling of arbitrary user-supplied query regexes. To enable, add the com.google.re2j dependency to your classpath, e.g.:
<dependency>
<groupId>com.google.re2j</groupId>
<artifactId>re2j</artifactId>
<version>1.8</version>
</dependency>
(If you already have that dependency in your classpath, but you want to keep using the Java regex engine, you can disable re2j via System.setProperty("jsoup.useRe2j", "false").) You can confirm that the re2j engine has been enabled correctly by calling org.jsoup.helper.Regex.usingRe2j(). #2407
- Added an instance method
Parser#unescape(String, boolean) that unescapes HTML entities using the parser's configuration (e.g. to support error tracking), complementing the existing static utility Parser.unescapeEntities(String, boolean). #2396
- Added a configurable maximum parser depth (to limit the number of open elements on stack) to both HTML and XML parsers. The HTML parser now defaults to a depth of 512 to match browser behavior, and protect against unbounded stack growth, while the XML parser keeps unlimited depth by default, but can opt into a limit via
org.jsoup.parser.Parser#setMaxDepth. #2421
- Build: added CI coverage for JDK 25 #2403
- Build: added a CI fuzzer for contextual fragment parsing (in addition to existing full body HTML and XML fuzzers). oss-fuzz #14041
Changes
- Set a removal schedule of jsoup 1.24.1 for previously deprecated APIs.
Bug Fixes
- Previously cached child
Elements of an Element were not correctly invalidated in Node#replaceWith(Node), which could lead to incorrect results when subsequently calling Element#children(). #2391
- Attribute selector values are now compared literally without trimming. Previously, jsoup trimmed whitespace from selector values and from element attribute values, which could cause mismatches with browser behavior (e.g.
[attr=" foo "]). Now matches align with the CSS specification and browser engines. #2380
- When using the JDK HttpClient, any system default proxy (
ProxySelector.getDefault()) was ignored. Now, the system proxy is used if a per-request proxy is not set. #2388, #2390
- A
ValidationException could be thrown in the adoption agency algorithm with particularly broken input. Now logged as a parse error. #2393
- Null characters in the HTML body were not consistently removed; and in foreign content were not correctly replaced. #2395
- An
IndexOutOfBoundsException could be thrown when parsing a body fragment with crafted input. Now logged as a parse error. #2397, #2406
- When using StructuralEvaluators (e.g., a
parent child selector) across many retained threads, their memoized results could also be retained, increasing memory use. These results are now cleared immediately after use, reducing overall memory consumption. #2411
- Cloning a
Parser now preserves any custom TagSet applied to the parser. #2422, #2423
- Custom tags marked as
Tag.Void now parse and serialize like the built-in void elements: they no longer consume following content, and the XML serializer emits the expected self-closing form. #2425
- The
<br> element is once again classified as an inline tag (Tag.isBlock() == false), matching common developer expectations and its role as phrasing content in HTML, while pretty-printing and text extraction continue to treat it as a line break in the rendered output. #2387, #2439
- Fixed an intermittent truncation issue when fetching and parsing remote documents via
Jsoup.connect(url).get(). On responses without a charset header, the initial charset sniff could sometimes (depending on buffering / available() behavior) be mistaken for end-of-stream and a partial parse reused, dropping trailing content. #2448
TagSet copies no longer mutate their template during lazy lookups, preventing cross-thread ConcurrentModificationException when parsing with shared sessions. #2453
- Fixed parsing of
<svg> foreignObject content nested within a <p>, which could incorrectly move the HTML subtree outside the SVG. #2452
Internal Changes
- Deprecated internal helper
org.jsoup.internal.Functions (for removal in v1.23.1). This was previously used to support older Android API levels without full java.util.function coverage; jsoup now requires core library desugaring so this indirection is no longer necessary. #2412
Changes
- Deprecated internal (yet visible) methods
Normalizer#normalize(String, bool) and Attribute#shouldCollapseAttribute(Document.OutputSettings). These will be removed in a future version.
- Deprecated
Connection#sslSocketFactory(SSLSocketFactory) in favor of the new Connection#sslContext(SSLContext). Using sslSocketFactory will force the use of the legacy HttpUrlConnection implementation, which does not support HTTP/2. #2370
Improvements
- When pretty-printing, if there are consecutive text nodes (via DOM manipulation), the non-significant whitespace between them will be collapsed. #2349.
- Updated
Connection.Response#statusMessage() to return a simple loggable string message (e.g. "OK") when using the HttpClient implementation, which doesn't otherwise return any server-set status message. #2356
Attributes#size() and Attributes#isEmpty() now exclude any internal attributes (such as user data) from their count. This aligns with the attributes' serialized output and iterator. #2369
- Added
Connection#sslContext(SSLContext) to provide a custom SSL (TLS) context to requests, supporting both the HttpClient and the legacy HttUrlConnection implementations. #2370
- Performance optimizations for DOM manipulation methods including when repeatedly removing an element's first child (
element.child(0).remove(), and when using Parser#parseBodyFragement() to parse a large number of direct children. #2373.
Bug Fixes
- When parsing from an InputStream and a multibyte character happened to straddle a buffer boundary, the stream would not be completely read. #2353.
- In
NodeTraversor, if a last child element was removed during the head() call, the parent would be visited twice. #2355.
- Cloning an Element that has an Attributes object would add an empty internal user-data attribute to that clone, which would cause unexpected results for
Attributes#size() and Attributes#isEmpty(). #2356
- In a multithreaded application where multiple threads are calling
Element#children() on the same element concurrently, a race condition could happen when the method was generating the internal child element cache (a filtered view of its child nodes). Since concurrent reads of DOM objects should be threadsafe without external synchronization, this method has been updated to execute atomically. #2366
- When parsing HTML with svg:script elements in SVG elements, don't enter the Text insertion mode, but continue to parse as foreign content. Otherwise, misnested HTML could then cause an IndexOutOfBoundsException. #2374
- Malformed HTML could throw an IndexOutOfBoundsException during the adoption agency. #2377.
Changes
- Removed previously deprecated methods. #2317
- Deprecated the
:matchText pseduo-selector due to its side effects on the DOM; use the new ::textnode selector and the Element#selectNodes(String css, Class type) method instead. #2343
- Deprecated
Connection.Response#bufferUp() in lieu of Connection.Response#readFully() which can throw a checked IOException.
- Deprecated internal methods
Validate#ensureNotNull (replaced by typed Validate#expectNotNull); protected HTML appenders from Attribute and Node.
- If you happen to be using any of the deprecated methods, please take the opportunity now to migrate away from them, as they will be removed in a future release.
Improvements
- Enhanced the
Selector to support direct matching against nodes such as comments and text nodes. For example, you can now find an element that follows a specific comment: ::comment:contains(prices) + p will select p elements immediately after a <!-- prices: --> comment. Supported types include ::node, ::leafnode, ::comment, ::text, ::data, and ::cdata. Node contextual selectors like ::node:contains(text), :matches(regex), and :blank are also supported. Introduced Element#selectNodes(String css) and Element#selectNodes(String css, Class nodeType) for direct node selection. #2324
- Added
TagSet#onNewTag(Consumer<Tag> customizer): register a callback that’s invoked for each new or cloned Tag when it’s inserted into the set. Enables dynamic tweaks of tag options (for example, marking all custom tags as self-closing, or everything in a given namespace as preserving whitespace).
- Made
TokenQueue and CharacterReader autocloseable, to ensure that they will release their buffers back to the buffer pool, for later reuse.
- Added
Selector#evaluatorOf(String css), as a clearer way to obtain an Evaluator from a CSS query. An alias of QueryParser.parse(String css).
- Custom tags (defined via the
TagSet) in a foreign namespace (e.g. SVG) can be configured to parse as data tags.
- Added
NodeVisitor#traverse(Node) to simplify node traversal calls (vs. importing NodeTraversor).
- Updated the default user-agent string to improve compatibility. #2341
- The HTML parser now allows the specific text-data type (Data, RcData) to be customized for known tags. (Previously, that was only supported on custom tags.) #2326.
- Added
Connection#readFully() as a replacement for Connection#bufferUp() with an explicit IOException. Similarly, added Connection#readBody() over Connection#body(). Deprecated Connection#bufferUp(). #2327
- When serializing HTML, the
< and > characters are now escaped in attributes. This helps prevent a class of mutation XSS attacks. #2337
- Changed
Connection to prefer using the JDK's HttpClient over HttpUrlConnection, if available, to enable HTTP/2 support by default. Users can disable via -Djsoup.useHttpClient=false. #2340
Bug Fixes
- The contents of a
script in a svg foreign context should be parsed as script data, not text. #2320
Tag#isFormSubmittable() was updating the Tag's options. #2323
- The HTML pretty-printer would incorrectly trim whitespace when text followed an inline element in a block element. #2325
- Custom tags with hyphens or other non-letter characters in their names now work correctly as Data or RcData tags. Their closing tags are now tokenized properly. #2332
- When cloning an Element, the clone would retain the source's cached child Element list (if any), which could lead to incorrect results when modifying the clone's child elements. #2334
Changes
- To better follow the HTML5 spec and current browsers, the HTML parser no longer allows self-closing tags (
<foo />)
to close HTML elements by default. Foreign content (SVG, MathML), and content parsed with the XML parser, still
supports self-closing tags. If you need specific HTML tags to support self-closing, you can register a custom tag via
the TagSet configured in Parser.tagSet(), using Tag#set(Tag.SelfClose). Standard void tags (such as <img>,
<br>, etc.) continue to behave as usual and are not affected by this
change. #2300.
- The following internal components have been deprecated. If you do happen to be using any of these, please take the opportunity now to migrate away from them, as they will be removed in jsoup 1.21.1.
ChangeNotifyingArrayList, Document.updateMetaCharsetElement(), Document.updateMetaCharsetElement(boolean), HtmlTreeBuilder.isContentForTagData(String), Parser.isContentForTagData(String), Parser.setTreeBuilder(TreeBuilder), Tag.formatAsBlock(), Tag.isFormListed(), TokenQueue.addFirst(String), TokenQueue.chompTo(String), TokenQueue.chompToIgnoreCase(String), TokenQueue.consumeToIgnoreCase(String), TokenQueue.consumeWord(), TokenQueue.matchesAny(String...)
Functional Improvements
- Rebuilt the HTML pretty-printer, to simplify and consolidate the implementation, improve consistency, support custom
Tags, and provide a cleaner path for ongoing improvements. The specific HTML produced by the pretty-printer may be
different from previous versions. #2286.
- Added the ability to define custom tags, and to modify properties of known tags, via the
TagSet tag collection.
Their properties can impact both the parse and how content is
serialized (output as HTML or XML). #2285.
Element.cssSelector() will prefer to return shorter selectors by using ancestor IDs when available and unique. E.g.
#id > div > p instead of html > body > div > div > p #2283.
- Added
Elements.deselect(int index), Elements.deselect(Object o), and Elements.deselectAll() methods to remove
elements from the Elements list without removing them from the underlying DOM. Also added Elements.asList() method
to get a modifiable list of elements without affecting the DOM. (Individual Elements remain linked to the
DOM.) #2100.
- Added support for sending a request body from an InputStream with
Connection.requestBodyStream(InputStream stream). #1122.
- The XML parser now supports scoped xmlns: prefix namespace declarations, and applies the correct namespace to Tags and
Attributes. Also, added Tag#prefix(), Tag#localName(), Attribute#prefix(), Attribute#localName(), and
Attribute#namespace() to retrieve these. #2299.
- CSS identifiers are now escaped and unescaped correctly to the CSS spec.
Element#cssSelector() will emit
appropriately escaped selectors, and the QueryParser supports those. Added Selector.escapeCssIdentifier() and
Selector.unescapeCssIdentifier(). #2297, #2305
Structure and Performance Improvements
- Refactored the CSS
QueryParser into a clearer recursive descent
parser. #2310.
- CSS selectors with consecutive combinators (e.g.
div >> p) will throw an explicit parse
exception. #2311.
- Performance: reduced the shallow size of an Element from 40 to 32 bytes, and the NodeList from 32 to 24.
#2307.
- Performance: reduced GC load of new StringBuilders when tokenizing input
HTML. #2304.
- Made
Parser instances threadsafe, so that inadvertent use of the same instance across threads will not lead to
errors. For actual concurrency, use Parser#newInstance() per
thread. #2314.
Bug Fixes
- Element names containing characters invalid in XML are now normalized to valid XML names when
serializing. #1496.
- When serializing to XML, characters that are invalid in XML 1.0 should be removed (not
encoded). #1743.
- When converting a
Document to the W3C DOM in W3CDom, elements with an attribute in an undeclared namespace now
get a declaration of xmlns:prefix="undefined". This allows subsequent serialization to XML via W3CDom.asString()
to succeed. #2087.
- The
StreamParser could emit the final elements of a document twice, due to how onNodeCompleted was fired when closing out the stack. #2295.
- When parsing with the XML parser and error tracking enabled, the trailing
? in <?xml version="1.0"?> would
incorrectly emit an error. #2298.
- Calling
Element#cssSelector() on an element with combining characters in the class or ID now produces the correct output. #1984.
Changes
- Added support for http/2 requests in
Jsoup.connect(), when running on Java 11+, via the Java HttpClient
implementation. #2257.
- In this version of jsoup, the default is to make requests via the HttpUrlConnection implementation: use
System.setProperty("jsoup.useHttpClient", "true"); to enable making requests via the HttpClient instead ,
which will enable http/2 support, if available. This will become the default in a later version of jsoup, so now is
a good time to validate it.
- If you are repackaging the jsoup jar in your deployment (i.e. creating a shaded- or a fat-jar), make sure to specify
that as a Multi-Release
JAR.
- If the
HttpClient impl is not available in your JRE, requests will continue to be made via
HttpURLConnection (in http/1.1 mode).
- Updated the minimum Android API Level validation from 10 to 21. As with previous jsoup versions, Android
developers need to enable core library desugaring. The minimum Java version remains Java 8.
#2173
- Removed previously deprecated class:
org.jsoup.UncheckedIOException (replace with java.io.UncheckedIOException);
moved previously deprecated method Element Element#forEach(Consumer) to
void Element#forEach(Consumer()). #2246
- Deprecated the methods
Document#updateMetaCharsetElement(boolean) and Document#updateMetaCharsetElement(), as the
setting had no effect. When Document#charset(Charset) is called, the document's meta charset or XML encoding
instruction is always set. #2247
Improvements
- When cleaning HTML with a
Safelist that preserves relative links, the isValid() method will now consider these
links valid. Additionally, the enforced attribute rel=nofollow will only be added to external links when configured
in the safelist. #2245
- Added
Element#selectStream(String query) and Element#selectStream(Evaluator) methods, that return a Stream of
matching elements. Elements are evaluated and returned as they are found, and the stream can be
terminated early. #2092
Element objects now implement Iterable, enabling them to be used in enhanced for loops.
- Added support for fragment parsing from a
Reader via
Parser#parseFragmentInput(Reader, Element, String). #1177
- Reintroduced CLI executable examples, in
jsoup-examples.jar. #1702
- Optimized performance of selectors like
#id .class (and other similar descendant queries) by around 4.6x, by better
balancing the Ancestor evaluator's cost function in the query
planner. #2254
- Removed the legacy parsing rules for
<isindex> tags, which would autovivify a form element with labels. This is no
longer in the spec.
- Added
Elements.selectFirst(String cssQuery) and Elements.expectFirst(String cssQuery), to select the first
matching element from an Elements list. #2263
- When parsing with the XML parser, XML Declarations and Processing Instructions are directly handled, vs bouncing
through the HTML parser's bogus comment handler. Serialization for non-doctype declarations no longer end with a
spurious !. #2275
- When converting parsed HTML to XML or the W3C DOM, element names containing
< are normalized to _ to ensure valid
XML. For example, <foo<bar> becomes <foo_bar>, as XML does not allow < in element names, but HTML5
does. #2276
- Reimplemented the HTML5 Adoption Agency Algorithm to the current spec. This handles mis-nested formating / structural elements. #2278
Bug Fixes
- If an element has an
; in an attribute name, it could not be converted to a W3C DOM element, and so subsequent XPath
queries could miss that element. Now, the attribute name is more completely
normalized. #2244
- For backwards compatibility, reverted the internal attribute key for doctype names to
"name". #2241
- In
Connection, skip cookies that have no name, rather than throwing a validation
exception. #2242
- When running on JDK 1.8, the error
java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;
could be thrown when calling Response#body() after parsing from a URL and the buffer size was
exceeded. #2250
- For backwards compatibility, allow
null InputStream inputs to Jsoup.parse(InputStream stream, ...), by returning
an empty Document. #2252
- A
template tag containing an li within an open li would be parsed incorrectly, as it was not recognized as a
"special" tag (which have additional processing rules). Also, added the SVG and MathML namespace tags to the list of
special tags. #2258
- A
template tag containing a button within an open button would be parsed incorrectly, as the "in button scope"
check was not aware of the template element. Corrected other instances including MathML and SVG elements,
also. #2271
- An
:nth-child selector with a negative digit-less step, such as :nth-child(-n+2), would be parsed incorrectly as a
positive step, and so would not match as expected. #1147
- Calling
doc.charset(charset) on an empty XML document would throw an
IndexOutOfBoundsException. #2266
- Fixed a memory leak when reusing a nested
StructuralEvaluator (e.g., a selector ancestor chain like A B C) by
ensuring cache reset calls cascade to inner members. #2277
- Concurrent calls to
doc.clone().append(html) were not supported. When a document was cloned, its Parser was not cloned but was a shallow copy of the original parser. #2281
Bug Fixes
- When serializing to XML, attribute names containing
-, ., or digits were incorrectly marked as invalid and
removed. 2235
Improvements
- Optimized the throughput and memory use throughout the input read and parse flows, with heap allocations and GC
down between -6% and -89%, and throughput improved up to +143% for small inputs. Most inputs sizes will see
throughput increases of ~ 20%. These performance improvements come through recycling the backing byte[] and char[]
arrays used to read and parse the input. 2186
- Speed optimized
html() and Entities.escape() when the input contains UTF characters in a supplementary plane, by
around 49%. 2183
- The form associated elements returned by
FormElement.elements() now reflect changes made to the DOM,
subsequently to the original parse. 2140
- In the
TreeBuilder, the onNodeInserted() and onNodeClosed() events are now also fired for the outermost /
root Document node. This enables source position tracking on the Document node (which was previously unset). And
it also enables the node traversor to see the outer Document node. 2182
- Selected Elements can now be position swapped inline using
Elements#set(). 2212
Bug Fixes
Element.cssSelector() would fail if the element's class contained a *
character. 2169
- When tracking source ranges, a text node following an invalid self-closing element may be left
untracked. 2175
- When a document has no doctype, or a doctype not named
html, it should be parsed in Quirks
Mode. 2197
- With a selector like
div:has(span + a), the has() component was not working correctly, as the inner combining
query caused the evaluator to match those against the outer's siblings, not
children. 2187
- A selector query that included multiple
:has() components in a nested :has() might incorrectly
execute. 2131
- When cookie names in a response are duplicated, the simple view of cookies available via
Connection.Response#cookies() will provide the last one set. Generally it is better to use
the Jsoup.newSession method to maintain a cookie jar, as that
applies appropriate path selection on cookies when making requests. 1831
- When parsing named HTML entities, base entities should resolve if they are a prefix of the input token (and not in an
attribute). 2207
- Fixed incorrect tracking of source ranges for attributes merged from late-occurring elements that were implicitly
created (html or body). 2204
- Follow the current HTML specification in the tokenizer to allow
< as part of a tag name, instead of emitting it as a
character node. 2230
- Similarly, allow a
< as the start of an attribute name, vs creating a new element. The previous behavior was
intended to parse closer to what we anticipated the author's intent to be, but that does not align to the spec or to
how browsers behave. 1483
Configuration
📅 Schedule: (UTC)
- Branch creation
- At any time (no schedule defined)
- Automerge
- At any time (no schedule defined)
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.
Uh oh!
There was an error while loading. Please reload this page.
This PR contains the following updates:
1.18.1→1.22.2Release Notes
jhy/jsoup (org.jsoup:jsoup)
v1.22.2Improvements
NodeTraversorsupport for in-place DOM rewrites duringNodeVisitor.head(). Current-node edits such asremove,replace, andunwrapnow recover more predictably, while traversal stays within the original root subtree. This makes single-pass tree cleanup and normalization visitors easier to write, for example when unwrapping presentational elements or replacing text nodes as you walk the DOM. #2472Cleanermay be reused across concurrent threads, and that sharedSafelistinstances should not be mutated while in use. #2473TagSetfor current HTML elements: addeddialog,search,picture, andslot; madeins,del,button,audio,video, andcanvasinline by default (Tag#isInline(), aligned to phrasing content in the spec); and added readableElement.text()boundaries for controls and embedded objects via the newTag.TextBoundaryoption. This improves pretty-printing and keeps normalized text from running adjacent words together. #2493Bug Fixes
re2jdependency when not present. #2459NodeTraversorregression in 1.21.2 where removing or replacing the current node duringhead()could revisit the replacement node and loop indefinitely. The traversal docs now also clarify which inserted nodes are visited in the current pass. #2472available()call throwsIOException, as seen on JDK 8HttpURLConnection. #2474Cleanerno longer makes relative URL attributes in the input document absolute when cleaning or validating aDocument. URL normalization now applies only to the cleaned output, andSafelist.isSafeAttribute()is side effect free. #2475Cleanerno longer duplicates enforced attributes when the inputDocumentpreserves attribute case. A case-variant source attribute is now replaced by the enforced attribute in the cleaned output. #2476HttpClient, because the JDK would silently ignore that proxy and attempt to connect directly. Those requests now fall back to the legacyHttpURLConnectiontransport instead, which does support SOCKS. #2468Connection.Response.streamParser()andDataUtil.streamParser(Path, ...)could fail on small inputs without a declared charset, if the initial 5 KB charset sniff fully consumed the input and closed it before the stream parse began. #2483<!DOCTYPE root [<!ENTITY name "value">]>, now round-trip correctly. The subset is preserved as raw text only; entities are not expanded and external DTDs are not loaded. #2486Build Changes
v1.22.1Improvements
re2jregular expression engine for regex-based CSS selectors (e.g.[attr~=regex],:matches(regex)), which ensures linear-time performance for regex evaluation. This allows safer handling of arbitrary user-supplied query regexes. To enable, add thecom.google.re2jdependency to your classpath, e.g.:(If you already have that dependency in your classpath, but you want to keep using the Java regex engine, you can disable re2j via
System.setProperty("jsoup.useRe2j", "false").) You can confirm that the re2j engine has been enabled correctly by callingorg.jsoup.helper.Regex.usingRe2j(). #2407Parser#unescape(String, boolean)that unescapes HTML entities using the parser's configuration (e.g. to support error tracking), complementing the existing static utilityParser.unescapeEntities(String, boolean). #2396org.jsoup.parser.Parser#setMaxDepth. #2421Changes
Bug Fixes
Elementsof anElementwere not correctly invalidated inNode#replaceWith(Node), which could lead to incorrect results when subsequently callingElement#children(). #2391[attr=" foo "]). Now matches align with the CSS specification and browser engines. #2380ProxySelector.getDefault()) was ignored. Now, the system proxy is used if a per-request proxy is not set. #2388, #2390ValidationExceptioncould be thrown in the adoption agency algorithm with particularly broken input. Now logged as a parse error. #2393IndexOutOfBoundsExceptioncould be thrown when parsing a body fragment with crafted input. Now logged as a parse error. #2397, #2406parent childselector) across many retained threads, their memoized results could also be retained, increasing memory use. These results are now cleared immediately after use, reducing overall memory consumption. #2411Parsernow preserves any customTagSetapplied to the parser. #2422, #2423Tag.Voidnow parse and serialize like the built-in void elements: they no longer consume following content, and the XML serializer emits the expected self-closing form. #2425<br>element is once again classified as an inline tag (Tag.isBlock() == false), matching common developer expectations and its role as phrasing content in HTML, while pretty-printing and text extraction continue to treat it as a line break in the rendered output. #2387, #2439Jsoup.connect(url).get(). On responses without a charset header, the initial charset sniff could sometimes (depending on buffering /available()behavior) be mistaken for end-of-stream and a partial parse reused, dropping trailing content. #2448TagSetcopies no longer mutate their template during lazy lookups, preventing cross-threadConcurrentModificationExceptionwhen parsing with shared sessions. #2453<svg>foreignObjectcontent nested within a<p>, which could incorrectly move the HTML subtree outside the SVG. #2452Internal Changes
org.jsoup.internal.Functions(for removal in v1.23.1). This was previously used to support older Android API levels without fulljava.util.functioncoverage; jsoup now requires core library desugaring so this indirection is no longer necessary. #2412v1.21.2Changes
Normalizer#normalize(String, bool)andAttribute#shouldCollapseAttribute(Document.OutputSettings). These will be removed in a future version.Connection#sslSocketFactory(SSLSocketFactory)in favor of the newConnection#sslContext(SSLContext). UsingsslSocketFactorywill force the use of the legacyHttpUrlConnectionimplementation, which does not support HTTP/2. #2370Improvements
Connection.Response#statusMessage()to return a simple loggable string message (e.g. "OK") when using theHttpClientimplementation, which doesn't otherwise return any server-set status message. #2356Attributes#size()andAttributes#isEmpty()now exclude any internal attributes (such as user data) from their count. This aligns with the attributes' serialized output and iterator. #2369Connection#sslContext(SSLContext)to provide a custom SSL (TLS) context to requests, supporting both theHttpClientand the legacyHttUrlConnectionimplementations. #2370element.child(0).remove(), and when usingParser#parseBodyFragement()to parse a large number of direct children. #2373.Bug Fixes
NodeTraversor, if a last child element was removed during thehead()call, the parent would be visited twice. #2355.Attributes#size()andAttributes#isEmpty(). #2356Element#children()on the same element concurrently, a race condition could happen when the method was generating the internal child element cache (a filtered view of its child nodes). Since concurrent reads of DOM objects should be threadsafe without external synchronization, this method has been updated to execute atomically. #2366v1.21.1Changes
:matchTextpseduo-selector due to its side effects on the DOM; use the new::textnodeselector and theElement#selectNodes(String css, Class type)method instead. #2343Connection.Response#bufferUp()in lieu ofConnection.Response#readFully()which can throw a checked IOException.Validate#ensureNotNull(replaced by typedValidate#expectNotNull); protected HTML appenders from Attribute and Node.Improvements
Selectorto support direct matching against nodes such as comments and text nodes. For example, you can now find an element that follows a specific comment:::comment:contains(prices) + pwill selectpelements immediately after a<!-- prices: -->comment. Supported types include::node,::leafnode,::comment,::text,::data, and::cdata. Node contextual selectors like::node:contains(text),:matches(regex), and:blankare also supported. IntroducedElement#selectNodes(String css)andElement#selectNodes(String css, Class nodeType)for direct node selection. #2324TagSet#onNewTag(Consumer<Tag> customizer): register a callback that’s invoked for each new or cloned Tag when it’s inserted into the set. Enables dynamic tweaks of tag options (for example, marking all custom tags as self-closing, or everything in a given namespace as preserving whitespace).TokenQueueandCharacterReaderautocloseable, to ensure that they will release their buffers back to the buffer pool, for later reuse.Selector#evaluatorOf(String css), as a clearer way to obtain an Evaluator from a CSS query. An alias ofQueryParser.parse(String css).TagSet) in a foreign namespace (e.g. SVG) can be configured to parse as data tags.NodeVisitor#traverse(Node)to simplify node traversal calls (vs. importingNodeTraversor).Connection#readFully()as a replacement forConnection#bufferUp()with an explicit IOException. Similarly, addedConnection#readBody()overConnection#body(). DeprecatedConnection#bufferUp(). #2327<and>characters are now escaped in attributes. This helps prevent a class of mutation XSS attacks. #2337Connectionto prefer using the JDK's HttpClient over HttpUrlConnection, if available, to enable HTTP/2 support by default. Users can disable via-Djsoup.useHttpClient=false. #2340Bug Fixes
scriptin asvgforeign context should be parsed as script data, not text. #2320Tag#isFormSubmittable()was updating the Tag's options. #2323v1.20.1Changes
<foo />)to close HTML elements by default. Foreign content (SVG, MathML), and content parsed with the XML parser, still
supports self-closing tags. If you need specific HTML tags to support self-closing, you can register a custom tag via
the
TagSetconfigured inParser.tagSet(), usingTag#set(Tag.SelfClose). Standard void tags (such as<img>,<br>, etc.) continue to behave as usual and are not affected by thischange. #2300.
ChangeNotifyingArrayList,Document.updateMetaCharsetElement(),Document.updateMetaCharsetElement(boolean),HtmlTreeBuilder.isContentForTagData(String),Parser.isContentForTagData(String),Parser.setTreeBuilder(TreeBuilder),Tag.formatAsBlock(),Tag.isFormListed(),TokenQueue.addFirst(String),TokenQueue.chompTo(String),TokenQueue.chompToIgnoreCase(String),TokenQueue.consumeToIgnoreCase(String),TokenQueue.consumeWord(),TokenQueue.matchesAny(String...)Functional Improvements
Tags, and provide a cleaner path for ongoing improvements. The specific HTML produced by the pretty-printer may be
different from previous versions. #2286.
TagSettag collection.Their properties can impact both the parse and how content is
serialized (output as HTML or XML). #2285.
Element.cssSelector()will prefer to return shorter selectors by using ancestor IDs when available and unique. E.g.#id > div > pinstead ofhtml > body > div > div > p#2283.Elements.deselect(int index),Elements.deselect(Object o), andElements.deselectAll()methods to removeelements from the
Elementslist without removing them from the underlying DOM. Also addedElements.asList()methodto get a modifiable list of elements without affecting the DOM. (Individual Elements remain linked to the
DOM.) #2100.
Connection.requestBodyStream(InputStream stream). #1122.Attributes. Also, added
Tag#prefix(),Tag#localName(),Attribute#prefix(),Attribute#localName(), andAttribute#namespace()to retrieve these. #2299.Element#cssSelector()will emitappropriately escaped selectors, and the QueryParser supports those. Added
Selector.escapeCssIdentifier()andSelector.unescapeCssIdentifier(). #2297, #2305Structure and Performance Improvements
QueryParserinto a clearer recursive descentparser. #2310.
div >> p) will throw an explicit parseexception. #2311.
#2307.
HTML. #2304.
Parserinstances threadsafe, so that inadvertent use of the same instance across threads will not lead toerrors. For actual concurrency, use
Parser#newInstance()perthread. #2314.
Bug Fixes
serializing. #1496.
encoded). #1743.
Documentto the W3C DOM inW3CDom, elements with an attribute in an undeclared namespace nowget a declaration of
xmlns:prefix="undefined". This allows subsequent serialization to XML viaW3CDom.asString()to succeed. #2087.
StreamParsercould emit the final elements of a document twice, due to howonNodeCompletedwas fired when closing out the stack. #2295.?in<?xml version="1.0"?>wouldincorrectly emit an error. #2298.
Element#cssSelector()on an element with combining characters in the class or ID now produces the correct output. #1984.v1.19.1Changes
Jsoup.connect(), when running on Java 11+, via the Java HttpClientimplementation. #2257.
System.setProperty("jsoup.useHttpClient", "true");to enable making requests via the HttpClient instead ,which will enable http/2 support, if available. This will become the default in a later version of jsoup, so now is
a good time to validate it.
that as a Multi-Release
JAR.
HttpClientimpl is not available in your JRE, requests will continue to be made viaHttpURLConnection(inhttp/1.1mode).developers need to enable core library desugaring. The minimum Java version remains Java 8.
#2173
org.jsoup.UncheckedIOException(replace withjava.io.UncheckedIOException);moved previously deprecated method
Element Element#forEach(Consumer)tovoid Element#forEach(Consumer()). #2246Document#updateMetaCharsetElement(boolean)andDocument#updateMetaCharsetElement(), as thesetting had no effect. When
Document#charset(Charset)is called, the document's meta charset or XML encodinginstruction is always set. #2247
Improvements
Safelistthat preserves relative links, theisValid()method will now consider theselinks valid. Additionally, the enforced attribute
rel=nofollowwill only be added to external links when configuredin the safelist. #2245
Element#selectStream(String query)andElement#selectStream(Evaluator)methods, that return aStreamofmatching elements. Elements are evaluated and returned as they are found, and the stream can be
terminated early. #2092
Elementobjects now implementIterable, enabling them to be used in enhanced for loops.ReaderviaParser#parseFragmentInput(Reader, Element, String). #1177jsoup-examples.jar. #1702#id .class(and other similar descendant queries) by around 4.6x, by betterbalancing the Ancestor evaluator's cost function in the query
planner. #2254
<isindex>tags, which would autovivify aformelement with labels. This is nolonger in the spec.
Elements.selectFirst(String cssQuery)andElements.expectFirst(String cssQuery), to select the firstmatching element from an
Elementslist. #2263through the HTML parser's bogus comment handler. Serialization for non-doctype declarations no longer end with a
spurious
!. #2275<are normalized to_to ensure validXML. For example,
<foo<bar>becomes<foo_bar>, as XML does not allow<in element names, but HTML5does. #2276
Bug Fixes
;in an attribute name, it could not be converted to a W3C DOM element, and so subsequent XPathqueries could miss that element. Now, the attribute name is more completely
normalized. #2244
"name". #2241
Connection, skip cookies that have no name, rather than throwing a validationexception. #2242
java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;could be thrown when calling
Response#body()after parsing from a URL and the buffer size wasexceeded. #2250
nullInputStream inputs toJsoup.parse(InputStream stream, ...), by returningan empty
Document. #2252templatetag containing anliwithin an openliwould be parsed incorrectly, as it was not recognized as a"special" tag (which have additional processing rules). Also, added the SVG and MathML namespace tags to the list of
special tags. #2258
templatetag containing abuttonwithin an openbuttonwould be parsed incorrectly, as the "in button scope"check was not aware of the
templateelement. Corrected other instances including MathML and SVG elements,also. #2271
:nth-childselector with a negative digit-less step, such as:nth-child(-n+2), would be parsed incorrectly as apositive step, and so would not match as expected. #1147
doc.charset(charset)on an empty XML document would throw anIndexOutOfBoundsException. #2266StructuralEvaluator(e.g., a selector ancestor chain likeA B C) byensuring cache reset calls cascade to inner members. #2277
doc.clone().append(html)were not supported. When a document was cloned, itsParserwas not cloned but was a shallow copy of the original parser. #2281v1.18.3Bug Fixes
-,., or digits were incorrectly marked as invalid andremoved. 2235
v1.18.2Improvements
down between -6% and -89%, and throughput improved up to +143% for small inputs. Most inputs sizes will see
throughput increases of ~ 20%. These performance improvements come through recycling the backing
byte[]andchar[]arrays used to read and parse the input. 2186
html()andEntities.escape()when the input contains UTF characters in a supplementary plane, byaround 49%. 2183
FormElement.elements()now reflect changes made to the DOM,subsequently to the original parse. 2140
TreeBuilder, theonNodeInserted()andonNodeClosed()events are now also fired for the outermost /root
Documentnode. This enables source position tracking on the Document node (which was previously unset). Andit also enables the node traversor to see the outer Document node. 2182
Elements#set(). 2212Bug Fixes
Element.cssSelector()would fail if the element's class contained a*character. 2169
untracked. 2175
html, it should be parsed in QuirksMode. 2197
div:has(span + a), thehas()component was not working correctly, as the inner combiningquery caused the evaluator to match those against the outer's siblings, not
children. 2187
:has()components in a nested:has()might incorrectlyexecute. 2131
Connection.Response#cookies()will provide the last one set. Generally it is better to usethe Jsoup.newSession method to maintain a cookie jar, as that
applies appropriate path selection on cookies when making requests. 1831
attribute). 2207
created (
htmlorbody). 2204<as part of a tag name, instead of emitting it as acharacter node. 2230
<as the start of an attribute name, vs creating a new element. The previous behavior wasintended to parse closer to what we anticipated the author's intent to be, but that does not align to the spec or to
how browsers behave. 1483
Configuration
📅 Schedule: (UTC)
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.