TensorFlow text processing guide

The TensorFlow text processing guide documents libraries and workflows for natural language processing (NLP) and introduces important concepts for working with text.

KerasNLP

KerasNLP is a high-level natural language processing (NLP) library that includes all the latest Transformer-based models as well as lower-level tokenization utilities. It's the recommended solution for most NLP use cases.

  • Getting Started with KerasNLP: Learn KerasNLP by performing sentiment analysis at progressive levels of complexity, from using a pre-trained model to building your own Transformer from scratch.

tf.strings

The tf.strings module provides operations for working with string Tensors.

  • Unicode strings: Represent Unicode strings in TensorFlow and manipulate them using Unicode equivalents of standard string ops.

TensorFlow Text

If you need access to lower-level text processing tools, you can use TensorFlow Text. TensorFlow Text provides a collection of ops and libraries to help you work with input in text form such as raw text strings or documents.

Pre-processing

  • BERT Preprocessing with TF Text: Use TensorFlow Text preprocessing ops to transform text data into inputs for BERT.
  • Tokenizing with TF Text: Understand the tokenization options provided by TensorFlow Text. Learn when you might want to use one option over another, and how these tokenizers are called from within your model.
  • Subword tokenizers: Generate a subword vocabulary from a dataset, and use it to build a text.BertTokenizer from the vocabulary.

TensorFlow models – NLP

The TensorFlow Models - NLP library provides Keras primitives that can be assembled into Transformer-based models, and scaffold classes that enable easy experimentation with novel architectures.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2023年07月27日 UTC.