Decodes each string in `input` into a sequence of Unicode code points.
The character codepoints for all strings are returned using a single vector `char_values`, with strings expanded to characters in row-major order. Similarly, the character start byte offsets are returned using a single vector `char_to_byte_starts`, with strings expanded in row-major order.
The `row_splits` tensor indicates where the codepoints and start offsets for each input string begin and end within the `char_values` and `char_to_byte_starts` tensors. In particular, the values for the `i`th string (in row-major order) are stored in the slice `[row_splits[i]:row_splits[i+1]]`. Thus:
- `char_values[row_splits[i]+j]` is the Unicode codepoint for the `j`th character in the `i`th string (in row-major order).
- `char_to_bytes_starts[row_splits[i]+j]` is the start byte offset for the `j`th character in the `i`th string (in row-major order).
- `row_splits[i+1] - row_splits[i]` is the number of characters in the `i`th string (in row-major order).
Nested Classes
class | UnicodeDecodeWithOffsets.Options | Optional attributes for UnicodeDecodeWithOffsets
|
Constants
String | OP_NAME | The name of this op, as known by TensorFlow core engine |
Public Methods
Output<TInt64> |
charToByteStarts()
A 1D int32 Tensor containing the byte index in the input string where each
character in `char_values` starts.
|
Output<TInt32> |
charValues()
A 1D int32 Tensor containing the decoded codepoints.
|
static UnicodeDecodeWithOffsets<TInt64> |
create(Scope scope, Operand<TString> input, String inputEncoding, Options... options)
Factory method to create a class wrapping a new UnicodeDecodeWithOffsets operation using default output types.
|
static <T extends TNumber> UnicodeDecodeWithOffsets<T> |
create(Scope scope, Operand<TString> input, String inputEncoding, Class<T> Tsplits, Options... options)
Factory method to create a class wrapping a new UnicodeDecodeWithOffsets operation.
|
static UnicodeDecodeWithOffsets.Options |
errors(String errors)
|
static UnicodeDecodeWithOffsets.Options |
replaceControlCharacters(Boolean replaceControlCharacters)
|
static UnicodeDecodeWithOffsets.Options |
replacementChar(Long replacementChar)
|
Output<T> |
rowSplits()
A 1D int32 tensor containing the row splits.
|
Inherited Methods
Constants
public static final String OP_NAME
The name of this op, as known by TensorFlow core engine
Public Methods
public Output<TInt64> charToByteStarts ()
A 1D int32 Tensor containing the byte index in the input string where each character in `char_values` starts.
public static UnicodeDecodeWithOffsets<TInt64> create (Scope scope, Operand<TString> input, String inputEncoding, Options... options)
Factory method to create a class wrapping a new UnicodeDecodeWithOffsets operation using default output types.
Parameters
scope | current scope |
---|---|
input | The text to be decoded. Can have any shape. Note that the output is flattened to a vector of char values. |
inputEncoding | Text encoding of the input strings. This is any of the encodings supported by ICU ucnv algorithmic converters. Examples: `"UTF-16", "US ASCII", "UTF-8"`. |
options | carries optional attributes values |
Returns
- a new instance of UnicodeDecodeWithOffsets
public static UnicodeDecodeWithOffsets<T> create (Scope scope, Operand<TString> input, String inputEncoding, Class<T> Tsplits, Options... options)
Factory method to create a class wrapping a new UnicodeDecodeWithOffsets operation.
Parameters
scope | current scope |
---|---|
input | The text to be decoded. Can have any shape. Note that the output is flattened to a vector of char values. |
inputEncoding | Text encoding of the input strings. This is any of the encodings supported by ICU ucnv algorithmic converters. Examples: `"UTF-16", "US ASCII", "UTF-8"`. |
options | carries optional attributes values |
Returns
- a new instance of UnicodeDecodeWithOffsets
public static UnicodeDecodeWithOffsets.Options errors (String errors)
Parameters
errors | Error handling policy when there is invalid formatting found in the input. The value of 'strict' will cause the operation to produce a InvalidArgument error on any invalid input formatting. A value of 'replace' (the default) will cause the operation to replace any invalid formatting in the input with the `replacement_char` codepoint. A value of 'ignore' will cause the operation to skip any invalid formatting in the input and produce no corresponding output character. |
---|
public static UnicodeDecodeWithOffsets.Options replaceControlCharacters (Boolean replaceControlCharacters)
Parameters
replaceControlCharacters | Whether to replace the C0 control characters (00-1F) with the `replacement_char`. Default is false. |
---|
public static UnicodeDecodeWithOffsets.Options replacementChar (Long replacementChar)
Parameters
replacementChar | The replacement character codepoint to be used in place of any invalid formatting in the input when `errors='replace'`. Any valid unicode codepoint may be used. The default value is the default unicode replacement character is 0xFFFD or U+65533.) |
---|