UnicodeDecodeWithOffsets

public final class UnicodeDecodeWithOffsets

Decodes each string in `input` into a sequence of Unicode code points.

The character codepoints for all strings are returned using a single vector `char_values`, with strings expanded to characters in row-major order. Similarly, the character start byte offsets are returned using a single vector `char_to_byte_starts`, with strings expanded in row-major order.

The `row_splits` tensor indicates where the codepoints and start offsets for each input string begin and end within the `char_values` and `char_to_byte_starts` tensors. In particular, the values for the `i`th string (in row-major order) are stored in the slice `[row_splits[i]:row_splits[i+1]]`. Thus:

`char_values[row_splits[i]+j]` is the Unicode codepoint for the `j`th character in the `i`th string (in row-major order).
`char_to_bytes_starts[row_splits[i]+j]` is the start byte offset for the `j`th character in the `i`th string (in row-major order).
`row_splits[i+1] - row_splits[i]` is the number of characters in the `i`th string (in row-major order).

Nested Classes

class UnicodeDecodeWithOffsets.Options Optional attributes for UnicodeDecodeWithOffsets

Constants

String OP_NAME The name of this op, as known by TensorFlow core engine

Public Methods

Output<TInt64>	charToByteStarts() A 1D int32 Tensor containing the byte index in the input string where each character in `char_values` starts.
Output<TInt32>	charValues() A 1D int32 Tensor containing the decoded codepoints.
static UnicodeDecodeWithOffsets<TInt64>	create(Scope scope, Operand<TString> input, String inputEncoding, Options... options) Factory method to create a class wrapping a new UnicodeDecodeWithOffsets operation using default output types.
static <T extends TNumber> UnicodeDecodeWithOffsets<T>	create(Scope scope, Operand<TString> input, String inputEncoding, Class<T> Tsplits, Options... options) Factory method to create a class wrapping a new UnicodeDecodeWithOffsets operation.
static UnicodeDecodeWithOffsets.Options	errors(String errors)
static UnicodeDecodeWithOffsets.Options	replaceControlCharacters(Boolean replaceControlCharacters)
static UnicodeDecodeWithOffsets.Options	replacementChar(Long replacementChar)
Output<T>	rowSplits() A 1D int32 tensor containing the row splits.

Inherited Methods

From class org.tensorflow.op.RawOp

final boolean	equals(Object obj)
final int	hashCode()
Operation	op() Return this unit of computation as a single `Operation`.
final String	toString()

From class java.lang.Object

boolean	equals(Object arg0)
final Class<?>	getClass()
int	hashCode()
final void	notify()
final void	notifyAll()
String	toString()
final void	wait(long arg0, int arg1)
final void	wait(long arg0)
final void	wait()

From interface org.tensorflow.op.Op

abstract ExecutionEnvironment	env() Return the execution environment this op was created in.
abstract Operation	op() Return this unit of computation as a single `Operation`.

Constants

public static final String OP_NAME

The name of this op, as known by TensorFlow core engine

Constant Value: "UnicodeDecodeWithOffsets"

Public Methods

public Output<TInt64> charToByteStarts ()

A 1D int32 Tensor containing the byte index in the input string where each character in `char_values` starts.

public Output<TInt32> charValues ()

A 1D int32 Tensor containing the decoded codepoints.

public static UnicodeDecodeWithOffsets<TInt64> create (Scope scope, Operand<TString> input, String inputEncoding, Options... options)

Factory method to create a class wrapping a new UnicodeDecodeWithOffsets operation using default output types.

Parameters

scope	current scope
input	The text to be decoded. Can have any shape. Note that the output is flattened to a vector of char values.
inputEncoding	Text encoding of the input strings. This is any of the encodings supported by ICU ucnv algorithmic converters. Examples: `"UTF-16", "US ASCII", "UTF-8"`.
options	carries optional attributes values

Returns

a new instance of UnicodeDecodeWithOffsets

public static UnicodeDecodeWithOffsets<T> create (Scope scope, Operand<TString> input, String inputEncoding, Class<T> Tsplits, Options... options)

Factory method to create a class wrapping a new UnicodeDecodeWithOffsets operation.

Parameters

scope	current scope
input	The text to be decoded. Can have any shape. Note that the output is flattened to a vector of char values.
inputEncoding	Text encoding of the input strings. This is any of the encodings supported by ICU ucnv algorithmic converters. Examples: `"UTF-16", "US ASCII", "UTF-8"`.
options	carries optional attributes values

Returns

a new instance of UnicodeDecodeWithOffsets

public static UnicodeDecodeWithOffsets.Options errors (String errors)

Parameters

errors	Error handling policy when there is invalid formatting found in the input. The value of 'strict' will cause the operation to produce a InvalidArgument error on any invalid input formatting. A value of 'replace' (the default) will cause the operation to replace any invalid formatting in the input with the `replacement_char` codepoint. A value of 'ignore' will cause the operation to skip any invalid formatting in the input and produce no corresponding output character.

errors

Error handling policy when there is invalid formatting found in the input. The value of 'strict' will cause the operation to produce a InvalidArgument error on any invalid input formatting. A value of 'replace' (the default) will cause the operation to replace any invalid formatting in the input with the `replacement_char` codepoint. A value of 'ignore' will cause the operation to skip any invalid formatting in the input and produce no corresponding output character.

public static UnicodeDecodeWithOffsets.Options replaceControlCharacters (Boolean replaceControlCharacters)

Parameters

replaceControlCharacters	Whether to replace the C0 control characters (00-1F) with the `replacement_char`. Default is false.

public static UnicodeDecodeWithOffsets.Options replacementChar (Long replacementChar)

Parameters

replacementChar	The replacement character codepoint to be used in place of any invalid formatting in the input when `errors='replace'`. Any valid unicode codepoint may be used. The default value is the default unicode replacement character is 0xFFFD or U+65533.)

public Output<T> rowSplits ()

A 1D int32 tensor containing the row splits.