View source on GitHub |
Computes the unique values of x
over the whole dataset.
tft.vocabulary(
x: common_types.TensorType,
*,
top_k: Optional[int] = None,
frequency_threshold: Optional[int] = None,
vocab_filename: Optional[str] = None,
store_frequency: Optional[bool] = False,
reserved_tokens: Optional[Union[Sequence[str], tf.Tensor]] = None,
weights: Optional[tf.Tensor] = None,
labels: Optional[Union[tf.Tensor, tf.SparseTensor]] = None,
use_adjusted_mutual_info: bool = False,
min_diff_from_avg: Optional[int] = None,
coverage_top_k: Optional[int] = None,
coverage_frequency_threshold: Optional[int] = None,
key_fn: Optional[Callable[[Any], Any]] = None,
fingerprint_shuffle: Optional[bool] = False,
file_format: common_types.VocabularyFileFormatType = DEFAULT_VOCABULARY_FILE_FORMAT,
name: Optional[str] = None
) -> common_types.TemporaryAnalyzerOutputType
Computes The unique values taken by x
, which can be a Tensor
,
SparseTensor
, or RaggedTensor
of any size. The unique values will be
aggregated over all dimensions of x
and all instances.
In case file_format
is 'text' and one of the tokens contains the '\n' or
'\r' characters or is empty it will be discarded.
If an integer Tensor
is provided, its semantic type should be categorical
not a continuous/numeric, since computing a vocabulary over a continuous
feature is not appropriate.
The unique values are sorted by decreasing frequency and then reverse
lexicographical order (e.g. [('a', 5), ('c', 3), ('b', 3)]). This is true even
if x
is numerical dtype (e.g. [('3', 5), ('2', 3), ('111', 3)]).
For large datasets it is highly recommended to either set frequency_threshold or top_k to control the size of the output, and also the run time of this operation.
When labels are provided, we filter the vocabulary based on the relationship between the token's presence in a record and the label for that record, using (possibly adjusted) Mutual Information. Note: If labels are provided, the x input must be a unique set of per record, as the semantics of the mutual information calculation depend on a multi-hot representation of the input. Having unique input tokens per row is advisable but not required for a frequency-based vocabulary.
Supply key_fn
if you would like to generate a vocabulary with coverage over
specific keys.
A "coverage vocabulary" is the union of two vocabulary "arms". The "standard arm" of the vocabulary is equivalent to the one generated by the same function call with no coverage arguments. Adding coverage only appends additional entries to the end of the standard vocabulary.
The "coverage arm" of the vocabulary is determined by taking the
coverage_top_k
most frequent unique terms per key. A term's key is obtained
by applying key_fn
to the term. Use coverage_frequency_threshold
to lower
bound the frequency of entries in the coverage arm of the vocabulary.
Note this is currently implemented for the case where the key is contained within each vocabulary entry (b/117796748).
Args | |
---|---|
x
|
A categorical/discrete input Tensor , SparseTensor , or RaggedTensor
with dtype tf.string or tf.int[8|16|32|64]. The inputs should generally be
unique per row (i.e. a bag of words/ngrams representation).
|
top_k
|
Limit the generated vocabulary to the first top_k elements. If set
to None, the full vocabulary is generated.
|
frequency_threshold
|
Limit the generated vocabulary only to elements whose absolute frequency is >= to the supplied threshold. If set to None, the full vocabulary is generated. Absolute frequency means the number of occurrences of the element in the dataset, as opposed to the proportion of instances that contain that element. |
vocab_filename
|
The file name for the vocabulary file. If None, a file name
will be chosen based on the current scope. If not None, should be unique
within a given preprocessing function. NOTE To make your pipelines
resilient to implementation details please set vocab_filename when you
are using the vocab_filename on a downstream component.
|
store_frequency
|
If True, frequency of the words is stored in the vocabulary
file. In the case labels are provided, the mutual information is stored in
the file instead. Each line in the file will be of the form 'frequency
word'. NOTE: if this is True then the computed vocabulary cannot be used
with tft.apply_vocabulary directly, since frequencies are added to the
beginning of each row of the vocabulary, which the mapper will not ignore.
|
reserved_tokens
|
(Optional) A list of tokens that should appear in the vocabulary regardless of their appearance in the input. These tokens would maintain their order, and have a reserved spot at the beginning of the vocabulary. Note: this field has no affect on cache. |
weights
|
(Optional) Weights Tensor for the vocabulary. It must have the
same shape as x.
|
labels
|
(Optional) Labels dense Tensor for the vocabulary. If provided,
the vocabulary is calculated based on mutual information with the label,
rather than frequency. The labels must have the same batch dimension as x.
If x is sparse, labels should be a 1D tensor reflecting row-wise labels.
If x is dense, labels can either be a 1D tensor of row-wise labels, or a
dense tensor of the identical shape as x (i.e. element-wise labels).
Labels should be a discrete integerized tensor (If the label is numeric,
it should first be bucketized; If the label is a string, an integer
vocabulary should first be applied). Note: CompositeTensor labels are
not yet supported (b/134931826). WARNING: When labels are provided, the
frequency_threshold argument functions as a mutual information threshold,
which is a float.
|
use_adjusted_mutual_info
|
If true, and labels are provided, calculate vocabulary using adjusted rather than raw mutual information. |
min_diff_from_avg
|
MI (or AMI) of a feature x label will be adjusted to zero whenever the difference between count and the expected (average) count is lower than min_diff_from_average. This can be thought of as a regularizing parameter that pushes small MI/AMI values to zero. If None, a default parameter will be selected based on the size of the dataset (see calculate_recommended_min_diff_from_avg). |
coverage_top_k
|
(Optional), (Experimental) The minimum number of elements per key to be included in the vocabulary. |
coverage_frequency_threshold
|
(Optional), (Experimental) Limit the coverage arm of the vocabulary only to elements whose absolute frequency is >= this threshold for a given key. |
key_fn
|
(Optional), (Experimental) A fn that takes in a single entry of x
and returns the corresponding key for coverage calculation. If this is
None , no coverage arm is added to the vocabulary.
|
fingerprint_shuffle
|
(Optional), (Experimental) Whether to sort the vocabularies by fingerprint instead of counts. This is useful for load balancing on the training parameter servers. Shuffle only happens while writing the files, so all the filters above (top_k, frequency_threshold, etc) will still take effect. |
file_format
|
(Optional) A str. The format of the resulting vocabulary file. Accepted formats are: 'tfrecord_gzip', 'text'. 'tfrecord_gzip' requires tensorflow>=2.4. The default value is 'text'. |
name
|
(Optional) A name for this operation. |
Returns | |
---|---|
The path name for the vocabulary file containing the unique values of x .
|