View on TensorFlow.org | Run in Google Colab | View on GitHub | Download notebook | See TF Hub model |
This colab demostrates the Universal Sentence Encoder CMLM model using the SentEval toolkit, which is a library for measuring the quality of sentence embeddings. The SentEval toolkit includes a diverse set of downstream tasks that are able to evaluate the generalization power of an embedding model and to evaluate the linguistic properties encoded.
Run the first two code blocks to setup the environment, in the third code block you can pick a SentEval task to evaluate the model. A GPU runtime is recommended to run this Colab.
To learn more about the Universal Sentence Encoder CMLM model, see https://openreview.net/forum?id=WDVD4lUCTzU
Install dependencies
pip install --quiet "tensorflow-text==2.11.*"
pip install --quiet torch==1.8.1
Download SentEval and task data
This step download SentEval from github and execute the data script to download the task data. It may take up to 5 minutes to complete.
Install SentEval and download task data
rm -rf ./SentEval
git clone https://github.com/facebookresearch/SentEval.git
cd $PWD/SentEval/data/downstream && bash get_transfer_data.bash > /dev/null 2>&1
Execute a SentEval evaluation task
The following code block executes a SentEval task and output the results, choose one of the following tasks to evaluate the USE CMLM model:
MR CR SUBJ MPQA SST TREC MRPC SICK-E
Select a model, params and task to run. The rapid prototyping params can be used for reducing computation time for faster result.
It typically takes 5-15 mins to complete a task with the 'rapid prototyping' params and up to an hour with the 'slower, best performance' params.
params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 5}
params['classifier'] = {'nhid': 0, 'optim': 'rmsprop', 'batch_size': 128,
'tenacity': 3, 'epoch_size': 2}
For better result, use the slower 'slower, best performance' params, computation may take up to 1 hour:
params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10}
params['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 16,
'tenacity': 5, 'epoch_size': 6}
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import sys
sys.path.append(f'{os.getcwd()}/SentEval')
import tensorflow as tf
# Prevent TF from claiming all GPU memory so there is some left for pytorch.
gpus = tf.config.list_physical_devices('GPU')
if gpus:
# Memory growth needs to be the same across GPUs.
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
import tensorflow_hub as hub
import tensorflow_text
import senteval
import time
PATH_TO_DATA = f'{os.getcwd()}/SentEval/data'
MODEL = 'https://tfhub.dev/google/universal-sentence-encoder-cmlm/en-base/1'
PARAMS = 'rapid prototyping'
TASK = 'CR'
params_prototyping = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 5}
params_prototyping['classifier'] = {'nhid': 0, 'optim': 'rmsprop', 'batch_size': 128,
'tenacity': 3, 'epoch_size': 2}
params_best = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10}
params_best['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 16,
'tenacity': 5, 'epoch_size': 6}
params = params_best if PARAMS == 'slower, best performance' else params_prototyping
preprocessor = hub.KerasLayer(
"https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
encoder = hub.KerasLayer(
"https://tfhub.dev/google/universal-sentence-encoder-cmlm/en-base/1")
inputs = tf.keras.Input(shape=tf.shape(''), dtype=tf.string)
outputs = encoder(preprocessor(inputs))
model = tf.keras.Model(inputs=inputs, outputs=outputs)
def prepare(params, samples):
return
def batcher(_, batch):
batch = [' '.join(sent) if sent else '.' for sent in batch]
return model.predict(tf.constant(batch))["default"]
se = senteval.engine.SE(params, batcher, prepare)
print("Evaluating task %s with %s parameters" % (TASK, PARAMS))
start = time.time()
results = se.eval(TASK)
end = time.time()
print('Time took on task %s : %.1f. seconds' % (TASK, end - start))
print(results)
Learn More
- Find more text embedding models on TensorFlow Hub
- See also the Multilingual Universal Sentence Encoder CMLM model
- Check out other Universal Sentence Encoder models
Reference
- Ziyi Yang, Yinfei Yang, Daniel Cer, Jax Law, Eric Darve. Universal Sentence Representations Learning with Conditional Masked Language Model. November 2020