این صفحه به‌وسیله ‏Cloud Translation API‏ ترجمه شده است.

بازیابی پرسش و پاسخ رمزگذار جمله جهانی چند زبانه

مشاهده در TensorFlow.org

در Google Colab اجرا شود

در GitHub مشاهده کنید

دانلود دفترچه یادداشت

مدل های TF Hub را ببینید

این نسخه ی نمایشی برای استفاده است جهانی رمزگذار چند زبانه پرسش و پاسخ مدل برای بازیابی پرسش و پاسخ از متن، نشان دادن استفاده از question_encoder و response_encoder از مدل. ما با استفاده از جملات از تیم ملی پاراگراف به عنوان مجموعه داده نسخه ی نمایشی، هر جمله و زمینه آن (متن اطراف جمله) به درونه گیریها ابعاد بالا با response_encoder کد گذاری. این درونه گیریها در یک شاخص ساخته شده با استفاده از ذخیره شده simpleneighbors کتابخانه برای بازیابی پرسش و پاسخ.

در بازیابی یک سوال تصادفی از انتخاب تیم ملی مجموعه داده ها و کد گذاری را به بعد بالا تعبیه با question_encoder و پرس و جو از شاخص simpleneighbors بازگشت یک لیست از نزدیکترین همسایگان تقریبی در فضای معنایی.

مدل های بیشتر

شما می توانید تمام متن در حال حاضر به میزبانی تعبیه مدل پیدا اینجا و تمام مدل های که در تیم ملی و همچنین آموزش داده شده است در اینجا .

برپایی

تنظیم محیط

%%capture
# Install the latest Tensorflow version.
!pip install -q tensorflow_text
!pip install -q simpleneighbors[annoy]
!pip install -q nltk
!pip install -q tqdm

واردات و توابع رایج را راه اندازی کنید

import json
import nltk
import os
import pprint
import random
import simpleneighbors
import urllib
from IPython.display import HTML, display
from tqdm.notebook import tqdm

import tensorflow.compat.v2 as tf
import tensorflow_hub as hub
from tensorflow_text import SentencepieceTokenizer

nltk.download('punkt')


def download_squad(url):
  return json.load(urllib.request.urlopen(url))

def extract_sentences_from_squad_json(squad):
  all_sentences = []
  for data in squad['data']:
    for paragraph in data['paragraphs']:
      sentences = nltk.tokenize.sent_tokenize(paragraph['context'])
      all_sentences.extend(zip(sentences, [paragraph['context']] * len(sentences)))
  return list(set(all_sentences)) # remove duplicates

def extract_questions_from_squad_json(squad):
  questions = []
  for data in squad['data']:
    for paragraph in data['paragraphs']:
      for qas in paragraph['qas']:
        if qas['answers']:
          questions.append((qas['question'], qas['answers'][0]['text']))
  return list(set(questions))

def output_with_highlight(text, highlight):
  output = "<li> "
  i = text.find(highlight)
  while True:
    if i == -1:
      output += text
      break
    output += text[0:i]
    output += '<b>'+text[i:i+len(highlight)]+'</b>'
    text = text[i+len(highlight):]
    i = text.find(highlight)
  return output + "</li>\n"

def display_nearest_neighbors(query_text, answer_text=None):
  query_embedding = model.signatures['question_encoder'](tf.constant([query_text]))['outputs'][0]
  search_results = index.nearest(query_embedding, n=num_results)

  if answer_text:
    result_md = '''
    <p>Random Question from SQuAD:</p>
    <p>&nbsp;&nbsp;<b>%s</b></p>
    <p>Answer:</p>
    <p>&nbsp;&nbsp;<b>%s</b></p>
    ''' % (query_text , answer_text)
  else:
    result_md = '''
    <p>Question:</p>
    <p>&nbsp;&nbsp;<b>%s</b></p>
    ''' % query_text

  result_md += '''
    <p>Retrieved sentences :
    <ol>
  '''

  if answer_text:
    for s in search_results:
      result_md += output_with_highlight(s, answer_text)
  else:
    for s in search_results:
      result_md += '<li>' + s + '</li>\n'

  result_md += "</ol>"
  display(HTML(result_md))

[nltk_data] Downloading package punkt to /home/kbuilder/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.

بلوک کد زیر را اجرا کنید تا مجموعه داده SQuAD را دانلود و استخراج کنید:

هر پاراگراف از تیم ملی مجموعه داده ها به جملات با استفاده از کتابخانه nltk و جمله و پاراگراف اشکال متن (متن، زمینه) تاپل خرد - جملات یک لیست از (متن، زمینه) تاپل است.
پرسش یک لیست از (سوال، پاسخ) تاپل است.

داده های SQuAD را دانلود و استخراج کنید

squad_url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json'

squad_json = download_squad(squad_url)
sentences = extract_sentences_from_squad_json(squad_json)
questions = extract_questions_from_squad_json(squad_json)
print("%s sentences, %s questions extracted from SQuAD %s" % (len(sentences), len(questions), squad_url))

print("\nExample sentence and context:\n")
sentence = random.choice(sentences)
print("sentence:\n")
pprint.pprint(sentence[0])
print("\ncontext:\n")
pprint.pprint(sentence[1])
print()

10455 sentences, 10552 questions extracted from SQuAD https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json

Example sentence and context:

sentence:

('The Mongol Emperors had built large palaces and pavilions, but some still '
 'continued to live as nomads at times.')

context:

("Since its invention in 1269, the 'Phags-pa script, a unified script for "
 'spelling Mongolian, Tibetan, and Chinese languages, was preserved in the '
 'court until the end of the dynasty. Most of the Emperors could not master '
 'written Chinese, but they could generally converse well in the language. The '
 'Mongol custom of long standing quda/marriage alliance with Mongol clans, the '
 'Onggirat, and the Ikeres, kept the imperial blood purely Mongol until the '
 'reign of Tugh Temur, whose mother was a Tangut concubine. The Mongol '
 'Emperors had built large palaces and pavilions, but some still continued to '
 'live as nomads at times. Nevertheless, a few other Yuan emperors actively '
 'sponsored cultural activities; an example is Tugh Temur (Emperor Wenzong), '
 'who wrote poetry, painted, read Chinese classical texts, and ordered the '
 'compilation of books.')

در زیر راه اندازی بلوک کد tensorflow نمودار گرم و جلسه با جهانی رمزگذار چند زبانه پرسش و پاسخ مدل question_encoder را و امضا response_encoder.

مدل بارگذاری از هاب تنسورفلو

module_url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/3"
model = hub.load(module_url)

بلوک کد زیر محاسبه درونه گیریها برای تمام متن، تاپل زمینه و ذخیره آنها را در یک simpleneighbors شاخص با استفاده از response_encoder.

تعبیه‌ها را محاسبه کرده و نمایه همسایگان ساده بسازید

batch_size = 100

encodings = model.signatures['response_encoder'](
  input=tf.constant([sentences[0][0]]),
  context=tf.constant([sentences[0][1]]))
index = simpleneighbors.SimpleNeighbors(
    len(encodings['outputs'][0]), metric='angular')

print('Computing embeddings for %s sentences' % len(sentences))
slices = zip(*(iter(sentences),) * batch_size)
num_batches = int(len(sentences) / batch_size)
for s in tqdm(slices, total=num_batches):
  response_batch = list([r for r, c in s])
  context_batch = list([c for r, c in s])
  encodings = model.signatures['response_encoder'](
    input=tf.constant(response_batch),
    context=tf.constant(context_batch)
  )
  for batch_index, batch in enumerate(response_batch):
    index.add_one(batch, encodings['outputs'][batch_index])

index.build()
print('simpleneighbors index for %s sentences built.' % len(sentences))

Computing embeddings for 10455 sentences
0%|          | 0/104 [00:00<?, ?it/s]
simpleneighbors index for 10455 sentences built.

در بازیابی، سوال با استفاده از question_encoder کد گذاری شده و تعبیه درخواست استفاده شده است به پرس و جو شاخص simpleneighbors.

نزدیکترین همسایگان را برای یک سوال تصادفی از SQuAD بازیابی کنید

num_results = 25

query = random.choice(questions)
display_nearest_neighbors(query[0], query[1])