The Commons Text library provides additions to the standard JDK's text handling. Our goal is to provide a consistent set of tools for processing text generally from computing distances between Strings to being able to efficiently do String escaping of various types.
Originally the text package was added in Commons Lang 2.2. However, its
new home is here. It provides, amongst other
classes, a replacement for StringBuffer named
StrBuilder, a class for substituting variables within a String
named StrSubstitutor and a replacement for StringTokenizer
named StrTokenizer. While somewhat ungainly, the
Str
prefix has been used to ensure we don't clash with any current
or future standard Java classes.
Beyond the text utilities ported over from Commons Lang, we have also included various string similarity and distance functions. Lastly, there are also utilities for addressing differences between bodies of text for the sake of viewing these differences.
From Lang 3.5, we have moved into Text StringEscapeUtils and StrTokenizer.
It provides ways in which to generate pieces of text, such as might
be used for default passwords. StringEscapeUtils contains methods to
escape and unescape Java, JavaScript, HTML and XML. It is worth noting that
the package org.apache.commons.text.translate holds the
functionality underpinning the StringEscapeUtils with mappings and translations
between such mappings for the sake of doing String escaping. StrTokenizer is
an improved alternative to java.util.StringTokenizer.
The simplest example is to use this class to replace Java System properties. For example:
StringSubstitutor.replaceSystemProperties(
"You are running with java.version = ${java.version} and os.name = ${os.name}.");
For details see StringSubstitutor.
Use a StringSubstitutorReader
to avoid reading a whole file into memory as a String to perform string substitution, for example, when a Servlet filters a file to a client.
To build a default full-featured substitutor, use:
The available substitutions are defined in org.apache.commons.text.lookup.StringLookupFactory.
The org.apache.commons.text.similarity packages contains various different mechanisms of
calculating "similarity scores" as well as "edit distances between Strings. Note,
the difference between a "similarity score" and a "distance function" is that
a distance functions meets the following qualifications:
d(x,y) >= 0, non-negativity or separation axiom
d(x,y) == 0, if and only if,
x == y
d(x,y) == d(y,x), symmetry, and
d(x,z) <= d(x,y) + d(y,z), the triangle inequality
The list of "edit distances" that we currently support follow:
The org.apache.commons.text.diff package contains code for
doing diff between strings. The initial implementation of the Myers algorithm was adapted from the
commons-collections sequence package.
Provides algorithms for diff between strings.
The initial implementation of the Myers algorithm was adapted from the commons-collections sequence package.
Provides algorithms for looking up strings used by a StringSubstitutor. Standard lookups are defined in StringLookupFactory and the associated DefaultStringLookup enum.
The example below demonstrates use of the default lookups for StringSubstitutor in order to
construct a complex string.
NOTE: The list of lookups available by default changed in version 1.10.0. See the documentation for StringLookupFactory for details and instructions on how to reproduce the previous behavior.
final StringSubstitutor interpolator = StringSubstitutor.createInterpolator();
final String text = interpolator.replace(
"Base64 Decoder: ${base64Decoder:SGVsbG9Xb3JsZCE=}\n" +
"Base64 Encoder: ${base64Encoder:HelloWorld!}\n" +
"Java Constant: ${const:java.awt.event.KeyEvent.VK_ESCAPE}\n" +
"Date: ${date:yyyy-MM-dd}\n" +
"Environment Variable: ${env:USERNAME}\n" +
"File Content: ${file:UTF-8:src/test/resources/document.properties}\n" +
"Java: ${java:version}\n" +
"Local host: ${localhost:canonical-name}\n" +
"Loopback address: ${loopbackAddress:canonical-name}\n" +
"Properties File: ${properties:src/test/resources/document.properties::mykey}\n" +
"Resource Bundle: ${resourceBundle:org.apache.commons.text.example.testResourceBundleLookup:mykey}\n" +
"System Property: ${sys:user.dir}\n" +
"URL Decoder: ${urlDecoder:Hello%20World%21}\n" +
"URL Encoder: ${urlEncoder:Hello World!}\n" +
"XML Decoder: ${xmlDecoder:<element>}\n" +
"XML Encoder: ${xmlEncoder:<element>}\n" +
"XML XPath: ${xml:src/test/resources/document.xml:/root/path/to/node}\n"
);
Provides algorithms for string similarity.
The algorithms that implement the EditDistance interface follow the same simple principle: the more similar (closer) strings are, the lower is the distance. For example, the words house and hose are closer than house and trousers.
The following algorithms are available at the moment:
CosineDistance
CosineSimilarity
FuzzyScore
HammingDistance
JaroWinklerDistance
JaroWinklerSimilarity
LevenshteinDistance
LongestCommonSubsequenceDistance
The CosineDistance utilises a
RegexTokenizer
regular expression tokenizer (\w+). And the
LevenshteinDistance's
behavior can be changed to take into consideration a maximum
throughput.
An API for creating text translation routines from a set of smaller building blocks. Initially created to make it possible for the user to customize the rules in the StringEscapeUtils class.
These classes are immutable, and therefore thread-safe.
Copyright © 2014-2025 The Apache Software Foundation. All Rights Reserved.