tfa.text.skip_gram_sample

Generates skip-gram token and label paired Tensors from the input tensor.

Generates skip-gram ("token", "label") pairs using each element in the rank-1 input_tensor as a token. The window size used for each token will be randomly selected from the range specified by [min_skips, max_skips], inclusive. See https://arxiv.org/abs/1301.3781 for more details about skip-gram.

For example, given input_tensor = ["the", "quick", "brown", "fox", "jumps"], min_skips = 1, max_skips = 2, emit_self_as_target = False, the output (tokens, labels) pairs for the token "quick" will be randomly selected from either (tokens=["quick", "quick"], labels=["the", "brown"]) for 1 skip, or (tokens=["quick", "quick", "quick"], labels=["the", "brown", "fox"]) for 2 skips.

If emit_self_as_target = True, each token will also be emitted as a label for itself. From the previous example, the output will be either (tokens=["quick", "quick", "quick"], labels=["the", "quick", "brown"]) for 1 skip, or (tokens=["quick", "quick", "quick", "quick"], labels=["the", "quick", "brown", "fox"]) for 2 skips.

The same process is repeated for each element of input_tensor and concatenated together into the two output rank-1 Tensors (one for all the tokens, another for all the labels).

If vocab_freq_table is specified, tokens in input_tensor that are not present in the vocabulary are discarded. Tokens whose frequency counts are below vocab_min_count are also discarded. Tokens whose frequency proportions in the corpus exceed vocab_subsampling may be randomly down-sampled. See Eq. 5 in http://arxiv.org/abs/1310.4546 for more details about subsampling.

input_tensor A rank-1 Tensor from which to generate skip-gram candidates.
min_skips int or scalar Tensor specifying the minimum window size to randomly use for each token. Must be >= 0 and <= max_skips. If min_skips and max_skips are both 0, the only label outputted will be the token itself when emit_self_as_target = True - or no output otherwise.
max_skips int or scalar Tensor specifying the maximum window size to randomly use for each token. Must be >= 0.
start int or scalar Tensor specifying the position in input_tensor from which to start generating skip-gram candidates.
limit int or scalar Tensor specifying the maximum number of elements in input_tensor to use in generating skip-gram candidates. -1 means to use the rest of the Tensor after start.
emit_self_as_target bool or scalar Tensor specifying whether to emit each token as a label for itself.
vocab_freq_table (Optional) A lookup table (subclass of lookup.InitializableLookupTableBase) that maps tokens to their raw frequency counts. If specified, any token in input_tensor that is not found in vocab_freq_table will be filtered out before generating skip-gram candidates. While this will typically map to integer raw frequency counts, it could also map to float frequency proportions. vocab_min_count and corpus_size should be in the same units as this.
vocab_min_count (Optional) int, float, or scalar Tensor specifying minimum frequency threshold (from vocab_freq_table) for a token to be kept in input_tensor. If this is specified, vocab_freq_table must also be specified - and they should both be in the same units.
vocab_subsampling (Optional) float specifying frequency proportion threshold for tokens from input_tensor. Tokens that occur more frequently (based on the ratio of the token's vocab_freq_table value to the corpus_size) will be randomly down-sampled. Reasonable starting values may be around 1e-3 or 1e-5. If this is specified, both vocab_freq_table and corpus_size must also be specified. See Eq. 5 in http://arxiv.org/abs/1310.4546 for more details.
corpus_size (Optional) int, float, or scalar Tensor specifying the total number of tokens in the corpus (e.g., sum of all the frequency counts of vocab_freq_table). Used with vocab_subsampling for down-sampling frequently occurring tokens. If this is specified, vocab_freq_table and vocab_subsampling must also be specified.
seed (Optional) int used to create a random seed for window size and subsampling. See set_random_seed docs for behavior.
name (Optional) A string name or a name scope for the operations.

A tuple containing (token, label) Tensors. Each output Tensor is of rank-1 and has the same type as input_tensor.

ValueError If vocab_freq_table is not provided, but vocab_min_count, vocab_subsampling, or corpus_size is specified. If vocab_subsampling and corpus_size are not both present or both absent.