text.Splitter

View source on GitHub

An abstract base class for splitting text.

text.Splitter(
 name=None
)

A Splitter is a module that splits strings into pieces. Generally, the pieces returned by a splitter correspond to substrings of the original string, and can be encoded using either strings or integer ids (where integer ids could be created by hashing strings or by looking them up in a fixed vocabulary table that maps strings to ids).

Each Splitter subclass must implement a split method, which subdivides each string in an input Tensor into pieces. E.g.:

classSimpleSplitter(tf_text.Splitter):
 defsplit(self, input):
 return tf.strings.split(input)
print(SimpleSplitter().split(["hello world", "this is a test"]))
<tf.RaggedTensor [[b'hello', b'world'], [b'this', b'is', b'a', b'test']]>

Methods

split

View source

@abc.abstractmethod
split(
 input
)

Splits the input tensor into pieces.

Generally, the pieces returned by a splitter correspond to substrings of the original string, and can be encoded using either strings or integer ids.

Example:

print(tf_text.WhitespaceTokenizer().split("small medium large"))
tf.Tensor([b'small' b'medium' b'large'], shape=(3,), dtype=string)

Args
input An N-dimensional UTF-8 string (or optionally integer) Tensor or RaggedTensor.

Returns
An N+1-dimensional UTF-8 string or integer Tensor or RaggedTensor. For each string from the input tensor, the final, extra dimension contains the pieces that string was split into.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025年04月11日 UTC.