View source on GitHub |
Maps the terms in x to their term frequency * inverse document frequency.
tft.tfidf(
x: tf.SparseTensor,
vocab_size: int,
smooth: bool = True,
name: Optional[str] = None
) -> Tuple[tf.SparseTensor, tf.SparseTensor]
The term frequency of a term in a document is calculated as (count of term in document) / (document size)
The inverse document frequency of a term is, by default, calculated as 1 + log((corpus size + 1) / (count of documents containing term + 1)).
Example usage:
def preprocessing_fn(inputs):
integerized = tft.compute_and_apply_vocabulary(inputs['x'])
vocab_size = tft.get_num_buckets_for_transformed_feature(integerized)
vocab_index, tfidf_weight = tft.tfidf(integerized, vocab_size)
return {
'index': vocab_index,
'tf_idf': tfidf_weight,
'integerized': integerized,
}
raw_data = [dict(x=["I", "like", "pie", "pie", "pie"]),
dict(x=["yum", "yum", "pie"])]
feature_spec = dict(x=tf.io.VarLenFeature(tf.string))
raw_data_metadata = tft.DatasetMetadata.from_feature_spec(feature_spec)
with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
transformed_dataset, transform_fn = (
(raw_data, raw_data_metadata)
| tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))
transformed_data, transformed_metadata = transformed_dataset
transformed_data
[{'index': array([0, 2, 3]), 'integerized': array([3, 2, 0, 0, 0]),
'tf_idf': array([0.6, 0.28109303, 0.28109303], dtype=float32)},
{'index': array([0, 1]), 'integerized': array([1, 1, 0]),
'tf_idf': array([0.33333334, 0.9369768 ], dtype=float32)}]
example strings: [["I", "like", "pie", "pie", "pie"], ["yum", "yum", "pie]]
in: SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [0, 3], [0, 4],
[1, 0], [1, 1], [1, 2]],
values=[1, 2, 0, 0, 0, 3, 3, 0])
out: SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [1, 0], [1, 1]],
values=[1, 2, 0, 3, 0])
SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [1, 0], [1, 1]],
values=[(1/5)*(log(3/2)+1), (1/5)*(log(3/2)+1), (3/5),
(2/3)*(log(3/2)+1), (1/3)]
Returns | |
---|---|
Two SparseTensor s with indices [index_in_batch, index_in_bag_of_words].
The first has values vocab_index, which is taken from input x .
The second has values tfidf_weight.
|
Raises | |
---|---|
ValueError if x does not have 2 dimensions.
|