หน้านี้ได้รับการแปลโดย Cloud Translation API

Multilingual Universal Sentence Encoder Q&A การเรียกค้นข้อมูล

ดูบน TensorFlow.org

นี่คือการสาธิตการใช้ ยูนิเวอร์แซ Encoder สื่อสารได้หลายภาษา Q & รูปแบบ สำหรับการดึงคำถามคำตอบของข้อความที่แสดงการใช้งานของ question_encoder และ response_encoder ของรูปแบบ เราใช้ประโยคจาก ทีม ย่อหน้าเป็นชุดสาธิตแต่ละประโยคและบริบท (ข้อความรอบประโยค) ถูกเข้ารหัสเป็น embeddings มิติสูงด้วย response_encoder embeddings เหล่านี้จะถูกเก็บไว้ในดัชนีสร้างขึ้นโดยใช้ simpleneighbors ห้องสมุดสำหรับการดึงคำถามคำตอบ

เกี่ยวกับการดึงคำถามที่สุ่มเลือกจาก ทีม ชุดและเข้ารหัสในมิติสูงฝังกับ question_encoder และแบบสอบถามดัชนี simpleneighbors กลับรายชื่อของเพื่อนบ้านที่ใกล้ที่สุดประมาณในพื้นที่ความหมาย

รุ่นอื่นๆ

คุณจะพบทุกรุ่นฝังเจ้าภาพในปัจจุบันข้อความ ที่นี่ และทุกรุ่นที่ได้รับการฝึกอบรมเกี่ยวกับทีมได้เป็นอย่างดี ที่นี่

ติดตั้ง

ตั้งค่าสภาพแวดล้อม

%%capture
# Install the latest Tensorflow version.
!pip install -q tensorflow_text
!pip install -q simpleneighbors[annoy]
!pip install -q nltk
!pip install -q tqdm

ตั้งค่าการนำเข้าและฟังก์ชันทั่วไป

import json
import nltk
import os
import pprint
import random
import simpleneighbors
import urllib
from IPython.display import HTML, display
from tqdm.notebook import tqdm

import tensorflow.compat.v2 as tf
import tensorflow_hub as hub
from tensorflow_text import SentencepieceTokenizer

nltk.download('punkt')


def download_squad(url):
  return json.load(urllib.request.urlopen(url))

def extract_sentences_from_squad_json(squad):
  all_sentences = []
  for data in squad['data']:
    for paragraph in data['paragraphs']:
      sentences = nltk.tokenize.sent_tokenize(paragraph['context'])
      all_sentences.extend(zip(sentences, [paragraph['context']] * len(sentences)))
  return list(set(all_sentences)) # remove duplicates

def extract_questions_from_squad_json(squad):
  questions = []
  for data in squad['data']:
    for paragraph in data['paragraphs']:
      for qas in paragraph['qas']:
        if qas['answers']:
          questions.append((qas['question'], qas['answers'][0]['text']))
  return list(set(questions))

def output_with_highlight(text, highlight):
  output = "<li> "
  i = text.find(highlight)
  while True:
    if i == -1:
      output += text
      break
    output += text[0:i]
    output += '<b>'+text[i:i+len(highlight)]+'</b>'
    text = text[i+len(highlight):]
    i = text.find(highlight)
  return output + "</li>\n"

def display_nearest_neighbors(query_text, answer_text=None):
  query_embedding = model.signatures['question_encoder'](tf.constant([query_text]))['outputs'][0]
  search_results = index.nearest(query_embedding, n=num_results)

  if answer_text:
    result_md = '''
    <p>Random Question from SQuAD:</p>
    <p>&nbsp;&nbsp;<b>%s</b></p>
    <p>Answer:</p>
    <p>&nbsp;&nbsp;<b>%s</b></p>
    ''' % (query_text , answer_text)
  else:
    result_md = '''
    <p>Question:</p>
    <p>&nbsp;&nbsp;<b>%s</b></p>
    ''' % query_text

  result_md += '''
    <p>Retrieved sentences :
    <ol>
  '''

  if answer_text:
    for s in search_results:
      result_md += output_with_highlight(s, answer_text)
  else:
    for s in search_results:
      result_md += '<li>' + s + '</li>\n'

  result_md += "</ol>"
  display(HTML(result_md))

[nltk_data] Downloading package punkt to /home/kbuilder/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.

เรียกใช้บล็อกโค้ดต่อไปนี้เพื่อดาวน์โหลดและแตกชุดข้อมูล SQUAAD ลงใน:

ประโยคคือรายการ (ข้อความบริบท) tuples - การแต่ละย่อหน้าจากทีมชุดข้อมูลที่มีการแบ่งตัวออกเป็นประโยคที่ใช้ห้องสมุด nltk และประโยคและวรรครูปแบบข้อความ (ข้อความบริบท) tuple
คำถามคือรายการ (คำถามคำตอบ) tuples

ดาวน์โหลดและดึงข้อมูล SQUAAD

squad_url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json'

squad_json = download_squad(squad_url)
sentences = extract_sentences_from_squad_json(squad_json)
questions = extract_questions_from_squad_json(squad_json)
print("%s sentences, %s questions extracted from SQuAD %s" % (len(sentences), len(questions), squad_url))

print("\nExample sentence and context:\n")
sentence = random.choice(sentences)
print("sentence:\n")
pprint.pprint(sentence[0])
print("\ncontext:\n")
pprint.pprint(sentence[1])
print()

10455 sentences, 10552 questions extracted from SQuAD https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json

Example sentence and context:

sentence:

('The Mongol Emperors had built large palaces and pavilions, but some still '
 'continued to live as nomads at times.')

context:

("Since its invention in 1269, the 'Phags-pa script, a unified script for "
 'spelling Mongolian, Tibetan, and Chinese languages, was preserved in the '
 'court until the end of the dynasty. Most of the Emperors could not master '
 'written Chinese, but they could generally converse well in the language. The '
 'Mongol custom of long standing quda/marriage alliance with Mongol clans, the '
 'Onggirat, and the Ikeres, kept the imperial blood purely Mongol until the '
 'reign of Tugh Temur, whose mother was a Tangut concubine. The Mongol '
 'Emperors had built large palaces and pavilions, but some still continued to '
 'live as nomads at times. Nevertheless, a few other Yuan emperors actively '
 'sponsored cultural activities; an example is Tugh Temur (Emperor Wenzong), '
 'who wrote poetry, painted, read Chinese classical texts, and ordered the '
 'compilation of books.')

การตั้งค่าการป้องกันรหัสต่อไปนี้ tensorflow กราฟกรัมและเซสชั่นกับ ยูนิเวอร์แซ Encoder สื่อสารได้หลายภาษา Q & รุ่น A 's question_encoder และลายเซ็น response_encoder

โหลดโมเดลจาก tensorflow hub

module_url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/3"
model = hub.load(module_url)

บล็อกรหัสต่อไปนี้คำนวณ embeddings สำหรับทุกข้อความ tuples บริบทและเก็บไว้ใน simpleneighbors ดัชนีโดยใช้ response_encoder

คำนวณการฝังและสร้างดัชนีเพื่อนบ้านที่เรียบง่าย

batch_size = 100

encodings = model.signatures['response_encoder'](
  input=tf.constant([sentences[0][0]]),
  context=tf.constant([sentences[0][1]]))
index = simpleneighbors.SimpleNeighbors(
    len(encodings['outputs'][0]), metric='angular')

print('Computing embeddings for %s sentences' % len(sentences))
slices = zip(*(iter(sentences),) * batch_size)
num_batches = int(len(sentences) / batch_size)
for s in tqdm(slices, total=num_batches):
  response_batch = list([r for r, c in s])
  context_batch = list([c for r, c in s])
  encodings = model.signatures['response_encoder'](
    input=tf.constant(response_batch),
    context=tf.constant(context_batch)
  )
  for batch_index, batch in enumerate(response_batch):
    index.add_one(batch, encodings['outputs'][batch_index])

index.build()
print('simpleneighbors index for %s sentences built.' % len(sentences))

Computing embeddings for 10455 sentences
0%|          | 0/104 [00:00<?, ?it/s]
simpleneighbors index for 10455 sentences built.

เกี่ยวกับการดึงคำถามที่ถูกเข้ารหัสโดยใช้ question_encoder และฝังคำถามที่ใช้ในการค้นหาดัชนี simpleneighbors

เรียกเพื่อนบ้านที่ใกล้ที่สุดเพื่อถามคำถามแบบสุ่มจาก SQUAAD

num_results = 25

query = random.choice(questions)
display_nearest_neighbors(query[0], query[1])