text.SentencepieceTokenizer

Tokenizes a tensor of UTF-8 strings.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter, Detokenizer

text.SentencepieceTokenizer(
    model=None,
    out_type=dtypes.int32,
    nbest_size=0,
    alpha=1.0,
    reverse=False,
    add_bos=False,
    add_eos=False,
    return_nbest=False,
    name=None
)

Used in the notebooks

Used in the guide
Tokenizing with TF Text

SentencePiece is an unsupervised text tokenizer and detokenizer. It is used mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units with the extension of direct training from raw sentences.

Before using the tokenizer, you will need to train a vocabulary and build a model configuration for it. Please visit the Sentencepiece repository for the most up-to-date instructions on this process.

Args
`model`	The sentencepiece model serialized proto.
`out_type`	output type. tf.int32 or tf.string (Default = tf.int32) Setting tf.int32 directly encodes the string into an id sequence.
`nbest_size`	A scalar for sampling. `nbest_size = {0,1}`: No sampling is performed. (default) `nbest_size > 1`: samples from the nbest_size results. `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.
`alpha`	A scalar for a smoothing parameter. Inverse temperature for probability rescaling.
`reverse`	Reverses the tokenized sequence (Default = false)
`add_bos`	Add beginning of sentence token to the result (Default = false)
`add_eos`	Add end of sentence token to the result (Default = false). When `reverse=True` beginning/end of sentence tokens are added after reversing.
`return_nbest`	If True requires that `nbest_size` is a scalar and `> 1`. Returns the `nbest_size` best tokenizations for each sentence instead of a single one. The returned tensor has shape `[batch * nbest, (tokens)]`.
`name`	The name argument that is passed to the op function.

Methods

`detokenize`

View source

detokenize(
    input, name=None
)

Detokenizes tokens into preprocessed text.

This function accepts tokenized text, and reforms it back into sentences.

Args
`input`	A `RaggedTensor` or `Tensor` of UTF-8 string tokens with a rank of at least 1.
`name`	The name argument that is passed to the op function.

Returns
A N-1 dimensional string Tensor or RaggedTensor of the detokenized text.

`id_to_string`

View source

id_to_string(
    input, name=None
)

Converts vocabulary id into a token.

Args
`input`	An arbitrary tensor of int32 representing the token IDs.
`name`	The name argument that is passed to the op function.

Returns
A tensor of string with the same shape as input.

`split`

View source

split(
    input
)

Alias for Tokenizer.tokenize.

`split_with_offsets`

View source

split_with_offsets(
    input
)

Alias for TokenizerWithOffsets.tokenize_with_offsets.

`string_to_id`

View source

string_to_id(
    input, name=None
)

Converts token into a vocabulary id.

This function is particularly helpful for determining the IDs for any special tokens whose ID could not be determined through normal tokenization.

Args
`input`	An arbitrary tensor of string tokens.
`name`	The name argument that is passed to the op function.

Returns
A tensor of int32 representing the IDs with the same shape as input.

`tokenize`

View source

tokenize(
    input, name=None
)

Tokenizes a tensor of UTF-8 strings.

Args
`input`	A `RaggedTensor` or `Tensor` of UTF-8 strings with any shape.
`name`	The name argument that is passed to the op function.

Returns
A `RaggedTensor` of tokenized text. The returned shape is the shape of the input tensor with an added ragged dimension for tokens of each string.

`tokenize_with_offsets`

View source

tokenize_with_offsets(
    input, name=None
)

Tokenizes a tensor of UTF-8 strings.

This function returns a tuple containing the tokens along with start and end byte offsets that mark where in the original string each token was located.

Args
`input`	A `RaggedTensor` or `Tensor` of UTF-8 strings with any shape.
`name`	The name argument that is passed to the op function.

Returns
A tuple `(tokens, start_offsets, end_offsets)` where:
`tokens`	is an N+1-dimensional UTF-8 string or integer `Tensor` or `RaggedTensor`.
`start_offsets`	is an N+1-dimensional integer `Tensor` or `RaggedTensor` containing the starting indices of each token (byte indices for input strings).
`end_offsets`	is an N+1-dimensional integer `Tensor` or `RaggedTensor` containing the exclusive ending indices of each token (byte indices for input strings).

`vocab_size`

View source

vocab_size(
    name=None
)

Returns the vocabulary size.

The number of tokens from within the Sentencepiece vocabulary provided at the time of initialization.

Args
`name`	The name argument that is passed to the op function.

Returns
A scalar representing the vocabulary size.

text.SentencepieceTokenizer

Used in the notebooks

Args

Methods

detokenize

id_to_string

split

split_with_offsets

string_to_id

tokenize

tokenize_with_offsets

vocab_size

`detokenize`

`id_to_string`

`split`

`split_with_offsets`

`string_to_id`

`tokenize`

`tokenize_with_offsets`

`vocab_size`