View source on GitHub |
Tokenizes a tensor of UTF-8 strings.
Inherits From: TokenizerWithOffsets
, Tokenizer
, SplitterWithOffsets
, Splitter
, Detokenizer
text.SentencepieceTokenizer(
model=None,
out_type=dtypes.int32,
nbest_size=0,
alpha=1.0,
reverse=False,
add_bos=False,
add_eos=False,
return_nbest=False,
name=None
)
Used in the notebooks
Used in the guide |
---|
SentencePiece is an unsupervised text tokenizer and detokenizer. It is used mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units with the extension of direct training from raw sentences.
Before using the tokenizer, you will need to train a vocabulary and build a model configuration for it. Please visit the Sentencepiece repository for the most up-to-date instructions on this process.
Methods
detokenize
detokenize(
input, name=None
)
Detokenizes tokens into preprocessed text.
This function accepts tokenized text, and reforms it back into sentences.
Args | |
---|---|
input
|
A RaggedTensor or Tensor of UTF-8 string tokens with a rank of
at least 1.
|
name
|
The name argument that is passed to the op function. |
Returns | |
---|---|
A N-1 dimensional string Tensor or RaggedTensor of the detokenized text. |
id_to_string
id_to_string(
input, name=None
)
Converts vocabulary id into a token.
Args | |
---|---|
input
|
An arbitrary tensor of int32 representing the token IDs. |
name
|
The name argument that is passed to the op function. |
Returns | |
---|---|
A tensor of string with the same shape as input. |
split
split(
input
)
Alias for Tokenizer.tokenize
.
split_with_offsets
split_with_offsets(
input
)
Alias for TokenizerWithOffsets.tokenize_with_offsets
.
string_to_id
string_to_id(
input, name=None
)
Converts token into a vocabulary id.
This function is particularly helpful for determining the IDs for any special tokens whose ID could not be determined through normal tokenization.
Args | |
---|---|
input
|
An arbitrary tensor of string tokens. |
name
|
The name argument that is passed to the op function. |
Returns | |
---|---|
A tensor of int32 representing the IDs with the same shape as input. |
tokenize
tokenize(
input, name=None
)
Tokenizes a tensor of UTF-8 strings.
Args | |
---|---|
input
|
A RaggedTensor or Tensor of UTF-8 strings with any shape.
|
name
|
The name argument that is passed to the op function. |
Returns | |
---|---|
A RaggedTensor of tokenized text. The returned shape is the shape of the
input tensor with an added ragged dimension for tokens of each string.
|
tokenize_with_offsets
tokenize_with_offsets(
input, name=None
)
Tokenizes a tensor of UTF-8 strings.
This function returns a tuple containing the tokens along with start and end byte offsets that mark where in the original string each token was located.
Args | |
---|---|
input
|
A RaggedTensor or Tensor of UTF-8 strings with any shape.
|
name
|
The name argument that is passed to the op function. |
Returns | |
---|---|
A tuple (tokens, start_offsets, end_offsets) where:
|
|
tokens
|
is an N+1-dimensional UTF-8 string or integer Tensor or
RaggedTensor .
|
start_offsets
|
is an N+1-dimensional integer Tensor or
RaggedTensor containing the starting indices of each token (byte
indices for input strings).
|
end_offsets
|
is an N+1-dimensional integer Tensor or
RaggedTensor containing the exclusive ending indices of each token
(byte indices for input strings).
|
vocab_size
vocab_size(
name=None
)
Returns the vocabulary size.
The number of tokens from within the Sentencepiece vocabulary provided at the time of initialization.
Args | |
---|---|
name
|
The name argument that is passed to the op function. |
Returns | |
---|---|
A scalar representing the vocabulary size. |