text.utf8_binarize

Decode UTF8 tokens into code points and return their bits.

See the RetVec paper for details.

Example:

code_points = utf8_binarize("hello", word_length=3, bits_per_char=4)
print(code_points.numpy())
[0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 1.]

The codepoints are encoded bitwise in the little-endian order. The inner dimension of the output is always word_length * bits_per_char, because extra characters are truncated / missing characters are padded, and bits_per_char lowest bits of each codepoint is stored.

Decoding errors (which in applications are often replaced with the character U+65533 "REPLACEMENT CHARACTER") are represented with replacement_char's bits_per_char lowest bits.

tokens A Tensor of tokens (strings) with any shape.
word_length Number of Unicode characters to process per word (the rest are silently ignored; the output is zero-padded).
bits_per_char The number of lowest bits of the Unicode codepoint to encode.
replacement_char The Unicode codepoint to use on decoding errors.
name The op name (optional).

A tensor of floating-point zero and one values corresponding to the bits of the token characters' Unicode code points.
Shape [<shape oftokens>, word_length * bits_per_char].