Module: text

View source on GitHub

Various tensorflow ops related to text-processing.

Modules

metrics module: Tensorflow text-processing metrics.

tflite_registrar module: tflite_registrar A module with a Python wrapper for TFLite TFText ops.

Classes

class BertTokenizer: Tokenizer used for BERT.

class ByteSplitter: Splits a string tensor into bytes.

class Detokenizer: Base class for detokenizer implementations.

class FastBertNormalizer: Normalizes a tensor of UTF-8 strings.

class FastBertTokenizer: Tokenizer used for BERT, a faster version with TFLite support.

class FastSentencepieceTokenizer: Sentencepiece tokenizer with tf.text interface.

class FastWordpieceTokenizer: Tokenizes a tensor of UTF-8 string tokens into subword pieces.

class FirstNItemSelector: An ItemSelector that selects the first n items in the batch.

class HubModuleSplitter: Splitter that uses a Hub module.

class HubModuleTokenizer: Tokenizer that uses a Hub module.

class LastNItemSelector: An ItemSelector that selects the last n items in the batch.

class MaskValuesChooser: Assigns values to the items chosen for masking.

class PhraseTokenizer: Tokenizes a tensor of UTF-8 string tokens into phrases.

class RandomItemSelector: An ItemSelector implementation that randomly selects items in a batch.

class Reduction: Type of reduction to be done by the n-gram op.

class RegexSplitter: RegexSplitter splits text on the given regular expression.

class RoundRobinTrimmer: A Trimmer that allocates a length budget to segments via round robin.

class SentencepieceTokenizer: Tokenizes a tensor of UTF-8 strings.

class ShrinkLongestTrimmer: A Trimmer that truncates the longest segment.

class SplitMergeFromLogitsTokenizer: Tokenizes a tensor of UTF-8 string into words according to logits.

class SplitMergeTokenizer: Tokenizes a tensor of UTF-8 string into words according to labels.

class Splitter: An abstract base class for splitting text.

class SplitterWithOffsets: An abstract base class for splitters that return offsets.

class StateBasedSentenceBreaker: A Splitter that uses a state machine to determine sentence breaks.

class Tokenizer: Base class for tokenizer implementations.

class TokenizerWithOffsets: Base class for tokenizer implementations that return offsets.

class Trimmer: Truncates a list of segments using a pre-determined truncation strategy.

class UnicodeCharTokenizer: Tokenizes a tensor of UTF-8 strings on Unicode character boundaries.

class UnicodeScriptTokenizer: Tokenizes UTF-8 by splitting when there is a change in Unicode script.

class WaterfallTrimmer: A Trimmer that allocates a length budget to segments in order.

class WhitespaceTokenizer: Tokenizes a tensor of UTF-8 strings on whitespaces.

class WordShape: Values for the 'pattern' arg of the wordshape op.

class WordpieceTokenizer: Tokenizes a tensor of UTF-8 string tokens into subword pieces.

Functions

boise_tags_to_offsets(...): Converts the token offsets and BOISE tags into span offsets and span type.

build_fast_bert_normalizer_model(...): build_fast_bert_normalizer_model(arg0: bool) -> bytes

build_fast_wordpiece_model(...): build_fast_wordpiece_model(arg0: list[str], arg1: int, arg2: str, arg3: str, arg4: bool, arg5: bool) -> bytes

case_fold_utf8(...): Applies case folding to every UTF-8 string in the input.

coerce_to_structurally_valid_utf8(...): Coerce UTF-8 input strings to structurally valid UTF-8.

combine_segments(...): Combine one or more input segments for a model's input sequence.

concatenate_segments(...): Concatenate input segments for a model's input sequence.

find_source_offsets(...): Maps the input post-normalized string offsets to pre-normalized offsets.

gather_with_default(...): Gather slices with indices=-1 mapped to default.

greedy_constrained_sequence(...): Performs greedy constrained sequence on a batch of examples.

mask_language_model(...): Applies dynamic language model masking.

max_spanning_tree(...): Finds the maximum directed spanning tree of a digraph.

max_spanning_tree_gradient(...): Returns a subgradient of the MaximumSpanningTree op.

ngrams(...): Create a tensor of n-grams based on the input data data.

normalize_utf8(...): Normalizes each UTF-8 string in the input tensor using the specified rule.

normalize_utf8_with_offsets_map(...): Normalizes each UTF-8 string in the input tensor using the specified rule.

offsets_to_boise_tags(...): Converts the given tokens and spans in offsets format into BOISE tags.

pad_along_dimension(...): Add padding to the beginning and end of data in a specific dimension.

pad_model_inputs(...): Pad model input and generate corresponding input masks.

regex_split(...): Split input by delimiters that match a regex pattern.

regex_split_with_offsets(...): Split input by delimiters that match a regex pattern; returns offsets.

sentence_fragments(...): Find the sentence fragments in a given text. (deprecated)

sliding_window(...): Builds a sliding window for data with a specified width.

span_alignment(...): Return an alignment from a set of source spans to a set of target spans.

span_overlaps(...): Returns a boolean tensor indicating which source and target spans overlap.

utf8_binarize(...): Decode UTF8 tokens into code points and return their bits.

viterbi_constrained_sequence(...): Performs greedy constrained sequence on a batch of examples.

wordshape(...): Determine wordshape features for each input string.

Other Members

version '2.19.0'

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025年04月11日 UTC.