Visualizza su TensorFlow.org | Esegui in Google Colab | Visualizza l'origine su GitHub | Scarica quaderno |
Questo tutorial mostra due modi per caricare e preelaborare il testo.
- Innanzitutto, utilizzerai le utilità Keras e i livelli di preelaborazione. Questi includono
tf.keras.utils.text_dataset_from_directory
per trasformare i dati in untf.data.Dataset
etf.keras.layers.TextVectorization
per la standardizzazione, la tokenizzazione e la vettorizzazione dei dati. Se non conosci TensorFlow, dovresti iniziare con questi. - Quindi, utilizzerai utilità di livello inferiore come
tf.data.TextLineDataset
per caricare i file di testo e le API di testo TensorFlow , cometext.UnicodeScriptTokenizer
etext.case_fold_utf8
, per preelaborare i dati per un controllo più dettagliato.
# Be sure you're using the stable versions of both `tensorflow` and
# `tensorflow-text`, for binary compatibility.
pip uninstall -y tf-nightly keras-nightly
pip install tensorflow
pip install tensorflow-text
import collections
import pathlib
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import utils
from tensorflow.keras.layers import TextVectorization
import tensorflow_datasets as tfds
import tensorflow_text as tf_text
Esempio 1: prevedere il tag per una domanda Stack Overflow
Come primo esempio, scaricherai un set di dati di domande di programmazione da Stack Overflow. Ogni domanda ( "Come faccio a ordinare un dizionario per valore?" ) è etichettata esattamente con un tag ( Python
, CSharp
, JavaScript
o Java
). Il tuo compito è sviluppare un modello che prevede il tag per una domanda. Questo è un esempio di classificazione multiclasse, un tipo di problema di apprendimento automatico importante e ampiamente applicabile.
Scarica ed esplora il set di dati
Inizia scaricando il set di dati Stack Overflow usando tf.keras.utils.get_file
ed esplorando la struttura della directory:
data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'
dataset_dir = utils.get_file(
origin=data_url,
untar=True,
cache_dir='stack_overflow',
cache_subdir='')
dataset_dir = pathlib.Path(dataset_dir).parent
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz 6053888/6053168 [==============================] - 0s 0us/step 6062080/6053168 [==============================] - 0s 0us/step
list(dataset_dir.iterdir())
[PosixPath('/tmp/.keras/train'), PosixPath('/tmp/.keras/README.md'), PosixPath('/tmp/.keras/stack_overflow_16k.tar.gz'), PosixPath('/tmp/.keras/test')]
train_dir = dataset_dir/'train'
list(train_dir.iterdir())
[PosixPath('/tmp/.keras/train/java'), PosixPath('/tmp/.keras/train/csharp'), PosixPath('/tmp/.keras/train/javascript'), PosixPath('/tmp/.keras/train/python')]
Le train/csharp
, train/java
, train/python
e train/javascript
contengono molti file di testo, ognuno dei quali è una domanda di Stack Overflow.
Stampa un file di esempio e controlla i dati:
sample_file = train_dir/'python/1755.txt'
with open(sample_file) as f:
print(f.read())
why does this blank program print true x=true.def stupid():. x=false.stupid().print x
Carica il set di dati
Successivamente, caricherai i dati dal disco e li preparerai in un formato adatto per l'allenamento. Per fare ciò, utilizzerai l'utilità tf.keras.utils.text_dataset_from_directory
per creare un'etichetta tf.data.Dataset
. Se non tf.data
, è una potente raccolta di strumenti per la creazione di pipeline di input. (Ulteriori informazioni in tf.data: Build TensorFlow input pipelines guide.)
L'API tf.keras.utils.text_dataset_from_directory
prevede una struttura di directory come segue:
train/
...csharp/
......1.txt
......2.txt
...java/
......1.txt
......2.txt
...javascript/
......1.txt
......2.txt
...python/
......1.txt
......2.txt
Quando si esegue un esperimento di machine learning, è consigliabile dividere il set di dati in tre suddivisioni: training , validation e test .
Il set di dati Stack Overflow è già stato suddiviso in set di addestramento e test, ma manca di un set di convalida.
Crea un set di convalida utilizzando una divisione 80:20 dei dati di addestramento utilizzando tf.keras.utils.text_dataset_from_directory
con validation_split
impostato su 0.2
(ovvero 20%):
batch_size = 32
seed = 42
raw_train_ds = utils.text_dataset_from_directory(
train_dir,
batch_size=batch_size,
validation_split=0.2,
subset='training',
seed=seed)
Found 8000 files belonging to 4 classes. Using 6400 files for training.
Come suggerisce l'output della cella precedente, ci sono 8.000 esempi nella cartella di formazione, di cui utilizzerai l'80% (o 6.400) per la formazione. Imparerai in un attimo che puoi addestrare un modello passando un tf.data.Dataset
direttamente a Model.fit
.
In primo luogo, scorrere il set di dati e stampare alcuni esempi, per avere un'idea dei dati.
for text_batch, label_batch in raw_train_ds.take(1):
for i in range(10):
print("Question: ", text_batch.numpy()[i])
print("Label:", label_batch.numpy()[i])
Question: b'"my tester is going to the wrong constructor i am new to programming so if i ask a question that can be easily fixed, please forgive me. my program has a tester class with a main. when i send that to my regularpolygon class, it sends it to the wrong constructor. i have two constructors. 1 without perameters..public regularpolygon(). {. mynumsides = 5;. mysidelength = 30;. }//end default constructor...and my second, with perameters. ..public regularpolygon(int numsides, double sidelength). {. mynumsides = numsides;. mysidelength = sidelength;. }// end constructor...in my tester class i have these two lines:..regularpolygon shape = new regularpolygon(numsides, sidelength);. shape.menu();...numsides and sidelength were declared and initialized earlier in the testing class...so what i want to happen, is the tester class sends numsides and sidelength to the second constructor and use it in that class. but it only uses the default constructor, which therefor ruins the whole rest of the program. can somebody help me?..for those of you who want to see more of my code: here you go..public double vertexangle(). {. system.out.println(""the vertex angle method: "" + mynumsides);// prints out 5. system.out.println(""the vertex angle method: "" + mysidelength); // prints out 30.. double vertexangle;. vertexangle = ((mynumsides - 2.0) / mynumsides) * 180.0;. return vertexangle;. }//end method vertexangle..public void menu().{. system.out.println(mynumsides); // prints out what the user puts in. system.out.println(mysidelength); // prints out what the user puts in. gotographic();. calcr(mynumsides, mysidelength);. calcr(mynumsides, mysidelength);. print(); .}// end menu...this is my entire tester class:..public static void main(string[] arg).{. int numsides;. double sidelength;. scanner keyboard = new scanner(system.in);.. system.out.println(""welcome to the regular polygon program!"");. system.out.println();.. system.out.print(""enter the number of sides of the polygon ==> "");. numsides = keyboard.nextint();. system.out.println();.. system.out.print(""enter the side length of each side ==> "");. sidelength = keyboard.nextdouble();. system.out.println();.. regularpolygon shape = new regularpolygon(numsides, sidelength);. shape.menu();.}//end main...for testing it i sent it numsides 4 and sidelength 100."\n' Label: 1 Question: b'"blank code slow skin detection this code changes the color space to lab and using a threshold finds the skin area of an image. but it\'s ridiculously slow. i don\'t know how to make it faster ? ..from colormath.color_objects import *..def skindetection(img, treshold=80, color=[255,20,147]):.. print img.shape. res=img.copy(). for x in range(img.shape[0]):. for y in range(img.shape[1]):. rgbimg=rgbcolor(img[x,y,0],img[x,y,1],img[x,y,2]). labimg=rgbimg.convert_to(\'lab\', debug=false). if (labimg.lab_l > treshold):. res[x,y,:]=color. else: . res[x,y,:]=img[x,y,:].. return res"\n' Label: 3 Question: b'"option and validation in blank i want to add a new option on my system where i want to add two text files, both rental.txt and customer.txt. inside each text are id numbers of the customer, the videotape they need and the price...i want to place it as an option on my code. right now i have:...add customer.rent return.view list.search.exit...i want to add this as my sixth option. say for example i ordered a video, it would display the price and would let me confirm the price and if i am going to buy it or not...here is my current code:.. import blank.io.*;. import blank.util.arraylist;. import static blank.lang.system.out;.. public class rentalsystem{. static bufferedreader input = new bufferedreader(new inputstreamreader(system.in));. static file file = new file(""file.txt"");. static arraylist<string> list = new arraylist<string>();. static int rows;.. public static void main(string[] args) throws exception{. introduction();. system.out.print(""nn"");. login();. system.out.print(""nnnnnnnnnnnnnnnnnnnnnn"");. introduction();. string repeat;. do{. loadfile();. system.out.print(""nwhat do you want to do?nn"");. system.out.print(""n - - - - - - - - - - - - - - - - - - - - - - -"");. system.out.print(""nn | 1. add customer | 2. rent return |n"");. system.out.print(""n - - - - - - - - - - - - - - - - - - - - - - -"");. system.out.print(""nn | 3. view list | 4. search |n"");. system.out.print(""n - - - - - - - - - - - - - - - - - - - - - - -"");. system.out.print(""nn | 5. exit |n"");. system.out.print(""n - - - - - - - - - -"");. system.out.print(""nnchoice:"");. int choice = integer.parseint(input.readline());. switch(choice){. case 1:. writedata();. break;. case 2:. rentdata();. break;. case 3:. viewlist();. break;. case 4:. search();. break;. case 5:. system.out.println(""goodbye!"");. system.exit(0);. default:. system.out.print(""invalid choice: "");. break;. }. system.out.print(""ndo another task? [y/n] "");. repeat = input.readline();. }while(repeat.equals(""y""));.. if(repeat!=""y"") system.out.println(""ngoodbye!"");.. }.. public static void writedata() throws exception{. system.out.print(""nname: "");. string cname = input.readline();. system.out.print(""address: "");. string add = input.readline();. system.out.print(""phone no.: "");. string pno = input.readline();. system.out.print(""rental amount: "");. string ramount = input.readline();. system.out.print(""tapenumber: "");. string tno = input.readline();. system.out.print(""title: "");. string title = input.readline();. system.out.print(""date borrowed: "");. string dborrowed = input.readline();. system.out.print(""due date: "");. string ddate = input.readline();. createline(cname, add, pno, ramount,tno, title, dborrowed, ddate);. rentdata();. }.. public static void createline(string name, string address, string phone , string rental, string tapenumber, string title, string borrowed, string due) throws exception{. filewriter fw = new filewriter(file, true);. fw.write(""nname: ""+name + ""naddress: "" + address +""nphone no.: ""+ phone+""nrentalamount: ""+rental+""ntape no.: ""+ tapenumber+""ntitle: ""+ title+""ndate borrowed: ""+borrowed +""ndue date: ""+ due+"":rn"");. fw.close();. }.. public static void loadfile() throws exception{. try{. list.clear();. fileinputstream fstream = new fileinputstream(file);. bufferedreader br = new bufferedreader(new inputstreamreader(fstream));. rows = 0;. while( br.ready()). {. list.add(br.readline());. rows++;. }. br.close();. } catch(exception e){. system.out.println(""list not yet loaded."");. }. }.. public static void viewlist(){. system.out.print(""n~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");. system.out.print("" |list of all costumers|"");. system.out.print(""~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");. for(int i = 0; i <rows; i++){. system.out.println(list.get(i));. }. }. public static void rentdata()throws exception. { system.out.print(""n~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");. system.out.print("" |rent data list|"");. system.out.print(""~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");. system.out.print(""nenter customer name: "");. string cname = input.readline();. system.out.print(""date borrowed: "");. string dborrowed = input.readline();. system.out.print(""due date: "");. string ddate = input.readline();. system.out.print(""return date: "");. string rdate = input.readline();. system.out.print(""rent amount: "");. string ramount = input.readline();.. system.out.print(""you pay:""+ramount);... }. public static void search()throws exception. { system.out.print(""n~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");. system.out.print("" |search costumers|"");. system.out.print(""~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");. system.out.print(""nenter costumer name: "");. string cname = input.readline();. boolean found = false;.. for(int i=0; i < rows; i++){. string temp[] = list.get(i).split("","");.. if(cname.equals(temp[0])){. system.out.println(""search result:nyou are "" + temp[0] + "" from "" + temp[1] + "".""+ temp[2] + "".""+ temp[3] + "".""+ temp[4] + "".""+ temp[5] + "" is "" + temp[6] + "".""+ temp[7] + "" is "" + temp[8] + ""."");. found = true;. }. }.. if(!found){. system.out.print(""no results."");. }.. }.. public static boolean evaluate(string uname, string pass){. if (uname.equals(""admin"")&&pass.equals(""12345"")) return true;. else return false;. }.. public static string login()throws exception{. bufferedreader input=new bufferedreader(new inputstreamreader(system.in));. int counter=0;. do{. system.out.print(""username:"");. string uname =input.readline();. system.out.print(""password:"");. string pass =input.readline();.. boolean accept= evaluate(uname,pass);.. if(accept){. break;. }else{. system.out.println(""incorrect username or password!"");. counter ++;. }. }while(counter<3);.. if(counter !=3) return ""login successful"";. else return ""login failed"";. }. public static void introduction() throws exception{.. system.out.println("" - - - - - - - - - - - - - - - - - - - - - - - - -"");. system.out.println("" ! r e n t a l !"");. system.out.println("" ! ~ ~ ~ ~ ~ ! ================= ! ~ ~ ~ ~ ~ !"");. system.out.println("" ! s y s t e m !"");. system.out.println("" - - - - - - - - - - - - - - - - - - - - - - - - -"");. }..}"\n' Label: 1 Question: b'"exception: dynamic sql generation for the updatecommand is not supported against a selectcommand that does not return any key i dont know what is the problem this my code : ..string nomtable;..datatable listeetablissementtable = new datatable();.datatable listeinteretstable = new datatable();.dataset ds = new dataset();.sqldataadapter da;.sqlcommandbuilder cmdb;..private void listeinterets_click(object sender, eventargs e).{. nomtable = ""listeinteretstable"";. d.cnx.open();. da = new sqldataadapter(""select nome from offices"", d.cnx);. ds = new dataset();. da.fill(ds, nomtable);. datagridview1.datasource = ds.tables[nomtable];.}..private void sauvgarder_click(object sender, eventargs e).{. d.cnx.open();. cmdb = new sqlcommandbuilder(da);. da.update(ds, nomtable);. d.cnx.close();.}"\n' Label: 0 Question: b'"parameter with question mark and super in blank, i\'ve come across a method that is formatted like this:..public final subscription subscribe(final action1<? super t> onnext, final action1<throwable> onerror) {.}...in the first parameter, what does the question mark and super mean?"\n' Label: 1 Question: b'call two objects wsdl the first time i got a very strange wsdl. ..i would like to call the object (interface - invoicecheck_out) do you know how?....i would like to call the object (variable) do you know how?..try to call (it`s ok)....try to call (how call this?)\n' Label: 0 Question: b"how to correctly make the icon for systemtray in blank using icon sizes of any dimension for systemtray doesn't look good overall. .what is the correct way of making icons for windows system tray?..screenshots: http://imgur.com/zsibwn9..icon: http://imgur.com/vsh4zo8\n" Label: 0 Question: b'"is there a way to check a variable that exists in a different script than the original one? i\'m trying to check if a variable, which was previously set to true in 2.py in 1.py, as 1.py is only supposed to continue if the variable is true...2.py..import os..completed = false..#some stuff here..completed = true...1.py..import 2 ..if completed == true. #do things...however i get a syntax error at ..if completed == true"\n' Label: 3 Question: b'"blank control flow i made a number which asks for 2 numbers with blank and responds with the corresponding message for the case. how come it doesnt work for the second number ? .regardless what i enter for the second number , i am getting the message ""your number is in the range 0-10""...using system;.using system.collections.generic;.using system.linq;.using system.text;..namespace consoleapplication1.{. class program. {. static void main(string[] args). {. string myinput; // declaring the type of the variables. int myint;.. string number1;. int number;... console.writeline(""enter a number"");. myinput = console.readline(); //muyinput is a string which is entry input. myint = int32.parse(myinput); // myint converts the string into an integer.. if (myint > 0). console.writeline(""your number {0} is greater than zero."", myint);. else if (myint < 0). console.writeline(""your number {0} is less than zero."", myint);. else. console.writeline(""your number {0} is equal zero."", myint);.. console.writeline(""enter another number"");. number1 = console.readline(); . number = int32.parse(myinput); .. if (number < 0 || number == 0). console.writeline(""your number {0} is less than zero or equal zero."", number);. else if (number > 0 && number <= 10). console.writeline(""your number {0} is in the range from 0 to 10."", number);. else. console.writeline(""your number {0} is greater than 10."", number);.. console.writeline(""enter another number"");.. }. } .}"\n' Label: 0 Question: b'"credentials cannot be used for ntlm authentication i am getting org.apache.commons.httpclient.auth.invalidcredentialsexception: credentials cannot be used for ntlm authentication: exception in eclipse..whether it is possible mention eclipse to take system proxy settings directly?..public class httpgetproxy {. private static final string proxy_host = ""proxy.****.com"";. private static final int proxy_port = 6050;.. public static void main(string[] args) {. httpclient client = new httpclient();. httpmethod method = new getmethod(""https://kodeblank.org"");.. hostconfiguration config = client.gethostconfiguration();. config.setproxy(proxy_host, proxy_port);.. string username = ""*****"";. string password = ""*****"";. credentials credentials = new usernamepasswordcredentials(username, password);. authscope authscope = new authscope(proxy_host, proxy_port);.. client.getstate().setproxycredentials(authscope, credentials);.. try {. client.executemethod(method);.. if (method.getstatuscode() == httpstatus.sc_ok) {. string response = method.getresponsebodyasstring();. system.out.println(""response = "" + response);. }. } catch (ioexception e) {. e.printstacktrace();. } finally {. method.releaseconnection();. }. }.}...exception:... dec 08, 2017 1:41:39 pm . org.apache.commons.httpclient.auth.authchallengeprocessor selectauthscheme. info: ntlm authentication scheme selected. dec 08, 2017 1:41:39 pm org.apache.commons.httpclient.httpmethoddirector executeconnect. severe: credentials cannot be used for ntlm authentication: . org.apache.commons.httpclient.usernamepasswordcredentials. org.apache.commons.httpclient.auth.invalidcredentialsexception: credentials . cannot be used for ntlm authentication: . enter code here . org.apache.commons.httpclient.usernamepasswordcredentials. at org.apache.commons.httpclient.auth.ntlmscheme.authenticate(ntlmscheme.blank:332). at org.apache.commons.httpclient.httpmethoddirector.authenticateproxy(httpmethoddirector.blank:320). at org.apache.commons.httpclient.httpmethoddirector.executeconnect(httpmethoddirector.blank:491). at org.apache.commons.httpclient.httpmethoddirector.executewithretry(httpmethoddirector.blank:391). at org.apache.commons.httpclient.httpmethoddirector.executemethod(httpmethoddirector.blank:171). at org.apache.commons.httpclient.httpclient.executemethod(httpclient.blank:397). at org.apache.commons.httpclient.httpclient.executemethod(httpclient.blank:323). at httpgetproxy.main(httpgetproxy.blank:31). dec 08, 2017 1:41:39 pm org.apache.commons.httpclient.httpmethoddirector processproxyauthchallenge. info: failure authenticating with ntlm @proxy.****.com:6050"\n' Label: 1
Le etichette sono 0
, 1
, 2
o 3
. Per verificare quale di questi corrisponde a quale etichetta di stringa, puoi controllare la proprietà class_names
sul set di dati:
for i, label in enumerate(raw_train_ds.class_names):
print("Label", i, "corresponds to", label)
Label 0 corresponds to csharp Label 1 corresponds to java Label 2 corresponds to javascript Label 3 corresponds to python
Successivamente, creerai una convalida e un set di test utilizzando tf.keras.utils.text_dataset_from_directory
. Utilizzerai le restanti 1.600 revisioni del set di formazione per la convalida.
# Create a validation set.
raw_val_ds = utils.text_dataset_from_directory(
train_dir,
batch_size=batch_size,
validation_split=0.2,
subset='validation',
seed=seed)
Found 8000 files belonging to 4 classes. Using 1600 files for validation.
test_dir = dataset_dir/'test'
# Create a test set.
raw_test_ds = utils.text_dataset_from_directory(
test_dir,
batch_size=batch_size)
Found 8000 files belonging to 4 classes.
Preparare il set di dati per l'addestramento
Successivamente, standardizzerai, tokenizzerai e vettorizzerai i dati utilizzando il livello tf.keras.layers.TextVectorization
.
- La standardizzazione si riferisce alla preelaborazione del testo, in genere per rimuovere la punteggiatura o elementi HTML per semplificare il set di dati.
- La tokenizzazione si riferisce alla divisione di stringhe in token (ad esempio, la divisione di una frase in singole parole suddividendola in spazi bianchi).
- La vettorizzazione si riferisce alla conversione di token in numeri in modo che possano essere inseriti in una rete neurale.
Tutte queste attività possono essere eseguite con questo livello. (Puoi saperne di più su ciascuno di questi nei documenti dell'API tf.keras.layers.TextVectorization
.)
Notare che:
- La standardizzazione predefinita converte il testo in minuscolo e rimuove la punteggiatura (
standardize='lower_and_strip_punctuation'
). - Il tokenizer predefinito si divide su spazi bianchi (
split='whitespace'
). - La modalità di vettorizzazione predefinita è
'int'
(output_mode='int'
). Questo emette indici interi (uno per token). Questa modalità può essere utilizzata per creare modelli che tengano conto dell'ordine delle parole. Puoi anche utilizzare altre modalità, come'binary'
, per creare modelli di borse di parole .
Costruirai due modelli per saperne di più su standardizzazione, tokenizzazione e vettorizzazione con TextVectorization
:
- In primo luogo, utilizzerai la modalità di vettorizzazione
'binary'
per costruire un modello di bag-of-words. - Quindi, utilizzerai la modalità
'int'
con una ConvNet 1D.
VOCAB_SIZE = 10000
binary_vectorize_layer = TextVectorization(
max_tokens=VOCAB_SIZE,
output_mode='binary')
Per la modalità 'int'
, oltre alla dimensione massima del vocabolario, è necessario impostare una lunghezza massima esplicita della sequenza ( MAX_SEQUENCE_LENGTH
), che farà sì che il livello riempia o tronchi le sequenze esattamente ai valori output_sequence_length
:
MAX_SEQUENCE_LENGTH = 250
int_vectorize_layer = TextVectorization(
max_tokens=VOCAB_SIZE,
output_mode='int',
output_sequence_length=MAX_SEQUENCE_LENGTH)
Quindi, chiama TextVectorization.adapt
per adattare lo stato del livello di preelaborazione al set di dati. Ciò farà sì che il modello crei un indice di stringhe in numeri interi.
# Make a text-only dataset (without labels), then call `TextVectorization.adapt`.
train_text = raw_train_ds.map(lambda text, labels: text)
binary_vectorize_layer.adapt(train_text)
int_vectorize_layer.adapt(train_text)
Stampa il risultato dell'utilizzo di questi livelli per la preelaborazione dei dati:
def binary_vectorize_text(text, label):
text = tf.expand_dims(text, -1)
return binary_vectorize_layer(text), label
def int_vectorize_text(text, label):
text = tf.expand_dims(text, -1)
return int_vectorize_layer(text), label
# Retrieve a batch (of 32 reviews and labels) from the dataset.
text_batch, label_batch = next(iter(raw_train_ds))
first_question, first_label = text_batch[0], label_batch[0]
print("Question", first_question)
print("Label", first_label)
Question tf.Tensor(b'"what is the difference between these two ways to create an element? var a = document.createelement(\'div\');..a.id = ""mydiv"";...and..var a = document.createelement(\'div\').id = ""mydiv"";...what is the difference between them such that the first one works and the second one doesn\'t?"\n', shape=(), dtype=string) Label tf.Tensor(2, shape=(), dtype=int32)
print("'binary' vectorized question:",
binary_vectorize_text(first_question, first_label)[0])
'binary' vectorized question: tf.Tensor([[1. 1. 0. ... 0. 0. 0.]], shape=(1, 10000), dtype=float32)
print("'int' vectorized question:",
int_vectorize_text(first_question, first_label)[0])
'int' vectorized question: tf.Tensor( [[ 55 6 2 410 211 229 121 895 4 124 32 245 43 5 1 1 5 1 1 6 2 410 211 191 318 14 2 98 71 188 8 2 199 71 178 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]], shape=(1, 250), dtype=int64)
Come mostrato sopra, la modalità " 'binary'
" di TextVectorization
restituisce un array che indica quali token esistono almeno una volta nell'input, mentre la modalità 'int'
sostituisce ogni token con un numero intero, preservandone così l'ordine.
Puoi cercare il token (stringa) a cui corrisponde ogni intero chiamando TextVectorization.get_vocabulary
sul livello:
print("1289 ---> ", int_vectorize_layer.get_vocabulary()[1289])
print("313 ---> ", int_vectorize_layer.get_vocabulary()[313])
print("Vocabulary size: {}".format(len(int_vectorize_layer.get_vocabulary())))
1289 ---> roman 313 ---> source Vocabulary size: 10000
Sei quasi pronto per addestrare il tuo modello.
Come passaggio finale di pre-elaborazione, applicherai i livelli di TextVectorization
creati in precedenza ai set di addestramento, convalida e test:
binary_train_ds = raw_train_ds.map(binary_vectorize_text)
binary_val_ds = raw_val_ds.map(binary_vectorize_text)
binary_test_ds = raw_test_ds.map(binary_vectorize_text)
int_train_ds = raw_train_ds.map(int_vectorize_text)
int_val_ds = raw_val_ds.map(int_vectorize_text)
int_test_ds = raw_test_ds.map(int_vectorize_text)
Configura il set di dati per le prestazioni
Questi sono due metodi importanti che dovresti usare durante il caricamento dei dati per assicurarti che l'I/O non si blocchi.
-
Dataset.cache
mantiene i dati in memoria dopo che sono stati caricati dal disco. Ciò garantirà che il set di dati non diventi un collo di bottiglia durante l'addestramento del modello. Se il tuo set di dati è troppo grande per essere contenuto nella memoria, puoi anche utilizzare questo metodo per creare una cache su disco performante, che è più efficiente da leggere rispetto a molti file di piccole dimensioni. -
Dataset.prefetch
si sovrappone alla preelaborazione dei dati e all'esecuzione del modello durante l'addestramento.
Puoi saperne di più su entrambi i metodi e su come memorizzare nella cache i dati su disco nella sezione Prefetching della Guida all'API tf.data per prestazioni migliori .
AUTOTUNE = tf.data.AUTOTUNE
def configure_dataset(dataset):
return dataset.cache().prefetch(buffer_size=AUTOTUNE)
binary_train_ds = configure_dataset(binary_train_ds)
binary_val_ds = configure_dataset(binary_val_ds)
binary_test_ds = configure_dataset(binary_test_ds)
int_train_ds = configure_dataset(int_train_ds)
int_val_ds = configure_dataset(int_val_ds)
int_test_ds = configure_dataset(int_test_ds)
Allena il modello
È ora di creare la tua rete neurale.
Per i dati vettoriali 'binary'
, definisci un semplice modello lineare di bag-of-words, quindi configuralo e addestralo:
binary_model = tf.keras.Sequential([layers.Dense(4)])
binary_model.compile(
loss=losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer='adam',
metrics=['accuracy'])
history = binary_model.fit(
binary_train_ds, validation_data=binary_val_ds, epochs=10)
Epoch 1/10 200/200 [==============================] - 2s 4ms/step - loss: 1.1170 - accuracy: 0.6509 - val_loss: 0.9165 - val_accuracy: 0.7844 Epoch 2/10 200/200 [==============================] - 1s 3ms/step - loss: 0.7781 - accuracy: 0.8169 - val_loss: 0.7522 - val_accuracy: 0.8050 Epoch 3/10 200/200 [==============================] - 1s 3ms/step - loss: 0.6274 - accuracy: 0.8591 - val_loss: 0.6664 - val_accuracy: 0.8163 Epoch 4/10 200/200 [==============================] - 1s 3ms/step - loss: 0.5342 - accuracy: 0.8866 - val_loss: 0.6129 - val_accuracy: 0.8188 Epoch 5/10 200/200 [==============================] - 1s 3ms/step - loss: 0.4683 - accuracy: 0.9038 - val_loss: 0.5761 - val_accuracy: 0.8281 Epoch 6/10 200/200 [==============================] - 1s 3ms/step - loss: 0.4181 - accuracy: 0.9181 - val_loss: 0.5494 - val_accuracy: 0.8331 Epoch 7/10 200/200 [==============================] - 1s 3ms/step - loss: 0.3779 - accuracy: 0.9287 - val_loss: 0.5293 - val_accuracy: 0.8388 Epoch 8/10 200/200 [==============================] - 1s 3ms/step - loss: 0.3446 - accuracy: 0.9361 - val_loss: 0.5137 - val_accuracy: 0.8400 Epoch 9/10 200/200 [==============================] - 1s 3ms/step - loss: 0.3164 - accuracy: 0.9430 - val_loss: 0.5014 - val_accuracy: 0.8381 Epoch 10/10 200/200 [==============================] - 1s 3ms/step - loss: 0.2920 - accuracy: 0.9495 - val_loss: 0.4916 - val_accuracy: 0.8388
Successivamente, utilizzerai il livello vettorializzato 'int'
per creare una ConvNet 1D:
def create_model(vocab_size, num_labels):
model = tf.keras.Sequential([
layers.Embedding(vocab_size, 64, mask_zero=True),
layers.Conv1D(64, 5, padding="valid", activation="relu", strides=2),
layers.GlobalMaxPooling1D(),
layers.Dense(num_labels)
])
return model
# `vocab_size` is `VOCAB_SIZE + 1` since `0` is used additionally for padding.
int_model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=4)
int_model.compile(
loss=losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer='adam',
metrics=['accuracy'])
history = int_model.fit(int_train_ds, validation_data=int_val_ds, epochs=5)
Epoch 1/5 200/200 [==============================] - 9s 5ms/step - loss: 1.1471 - accuracy: 0.5016 - val_loss: 0.7856 - val_accuracy: 0.6913 Epoch 2/5 200/200 [==============================] - 1s 3ms/step - loss: 0.6378 - accuracy: 0.7550 - val_loss: 0.5494 - val_accuracy: 0.8056 Epoch 3/5 200/200 [==============================] - 1s 3ms/step - loss: 0.3900 - accuracy: 0.8764 - val_loss: 0.4845 - val_accuracy: 0.8206 Epoch 4/5 200/200 [==============================] - 1s 3ms/step - loss: 0.2234 - accuracy: 0.9447 - val_loss: 0.4819 - val_accuracy: 0.8188 Epoch 5/5 200/200 [==============================] - 1s 3ms/step - loss: 0.1146 - accuracy: 0.9809 - val_loss: 0.5038 - val_accuracy: 0.8150
Confronta i due modelli:
print("Linear model on binary vectorized data:")
print(binary_model.summary())
Linear model on binary vectorized data: Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense (Dense) (None, 4) 40004 ================================================================= Total params: 40,004 Trainable params: 40,004 Non-trainable params: 0 _________________________________________________________________ None
print("ConvNet model on int vectorized data:")
print(int_model.summary())
ConvNet model on int vectorized data: Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, None, 64) 640064 conv1d (Conv1D) (None, None, 64) 20544 global_max_pooling1d (Globa (None, 64) 0 lMaxPooling1D) dense_1 (Dense) (None, 4) 260 ================================================================= Total params: 660,868 Trainable params: 660,868 Non-trainable params: 0 _________________________________________________________________ None
Valuta entrambi i modelli sui dati del test:
binary_loss, binary_accuracy = binary_model.evaluate(binary_test_ds)
int_loss, int_accuracy = int_model.evaluate(int_test_ds)
print("Binary model accuracy: {:2.2%}".format(binary_accuracy))
print("Int model accuracy: {:2.2%}".format(int_accuracy))
250/250 [==============================] - 1s 3ms/step - loss: 0.5178 - accuracy: 0.8151 250/250 [==============================] - 1s 2ms/step - loss: 0.5262 - accuracy: 0.8073 Binary model accuracy: 81.51% Int model accuracy: 80.73%
Esporta il modello
Nel codice precedente, hai applicato tf.keras.layers.TextVectorization
al set di dati prima di inserire il testo nel modello. Se vuoi rendere il tuo modello in grado di elaborare stringhe grezze (ad esempio, per semplificarne la distribuzione), puoi includere il livello TextVectorization
all'interno del tuo modello.
Per farlo, puoi creare un nuovo modello utilizzando i pesi che hai appena allenato:
export_model = tf.keras.Sequential(
[binary_vectorize_layer, binary_model,
layers.Activation('sigmoid')])
export_model.compile(
loss=losses.SparseCategoricalCrossentropy(from_logits=False),
optimizer='adam',
metrics=['accuracy'])
# Test it with `raw_test_ds`, which yields raw strings
loss, accuracy = export_model.evaluate(raw_test_ds)
print("Accuracy: {:2.2%}".format(binary_accuracy))
250/250 [==============================] - 1s 4ms/step - loss: 0.5178 - accuracy: 0.8151 Accuracy: 81.51%
Ora, il tuo modello può prendere le stringhe grezze come input e prevedere un punteggio per ogni etichetta usando Model.predict
. Definisci una funzione per trovare l'etichetta con il punteggio massimo:
def get_string_labels(predicted_scores_batch):
predicted_int_labels = tf.argmax(predicted_scores_batch, axis=1)
predicted_labels = tf.gather(raw_train_ds.class_names, predicted_int_labels)
return predicted_labels
Esegui l'inferenza su nuovi dati
inputs = [
"how do I extract keys from a dict into a list?", # 'python'
"debug public static void main(string[] args) {...}", # 'java'
]
predicted_scores = export_model.predict(inputs)
predicted_labels = get_string_labels(predicted_scores)
for input, label in zip(inputs, predicted_labels):
print("Question: ", input)
print("Predicted label: ", label.numpy())
Question: how do I extract keys from a dict into a list? Predicted label: b'python' Question: debug public static void main(string[] args) {...} Predicted label: b'java'
L'inclusione della logica di preelaborazione del testo all'interno del modello consente di esportare un modello per la produzione che semplifica la distribuzione e riduce il potenziale di distorsione del treno/test .
C'è una differenza di prestazioni da tenere a mente quando si sceglie dove applicare tf.keras.layers.TextVectorization
. L'utilizzo al di fuori del modello consente di eseguire l'elaborazione asincrona della CPU e il buffering dei dati durante l'allenamento su GPU. Quindi, se stai addestrando il tuo modello sulla GPU, probabilmente vorrai utilizzare questa opzione per ottenere le migliori prestazioni durante lo sviluppo del tuo modello, quindi passare all'inclusione del livello TextVectorization
all'interno del tuo modello quando sei pronto per prepararti per la distribuzione .
Visita il tutorial Salva e carica modelli per ulteriori informazioni sul salvataggio dei modelli.
Esempio 2: prevedi l'autore delle traduzioni dell'Iliade
Di seguito viene fornito un esempio di utilizzo di tf.data.TextLineDataset
per caricare esempi da file di testo e TensorFlow Text per preelaborare i dati. Utilizzerai tre diverse traduzioni inglesi della stessa opera, l'Iliade di Omero, e formerai un modello per identificare il traduttore data una singola riga di testo.
Scarica ed esplora il set di dati
I testi delle tre traduzioni sono di:
I file di testo utilizzati in questo tutorial sono stati sottoposti ad alcune tipiche attività di preelaborazione come la rimozione di intestazione e piè di pagina del documento, numeri di riga e titoli di capitolo.
Scarica questi file leggermente sgranocchiati in locale:
DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']
for name in FILE_NAMES:
text_dir = utils.get_file(name, origin=DIRECTORY_URL + name)
parent_dir = pathlib.Path(text_dir).parent
list(parent_dir.iterdir())
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt 819200/815980 [==============================] - 0s 0us/step 827392/815980 [==============================] - 0s 0us/step Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt 811008/809730 [==============================] - 0s 0us/step 819200/809730 [==============================] - 0s 0us/step Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt 811008/807992 [==============================] - 0s 0us/step 819200/807992 [==============================] - 0s 0us/step [PosixPath('/home/kbuilder/.keras/datasets/derby.txt'), PosixPath('/home/kbuilder/.keras/datasets/butler.txt'), PosixPath('/home/kbuilder/.keras/datasets/cowper.txt'), PosixPath('/home/kbuilder/.keras/datasets/fashion-mnist'), PosixPath('/home/kbuilder/.keras/datasets/mnist.npz')]
Carica il set di dati
In precedenza, con tf.keras.utils.text_dataset_from_directory
tutti i contenuti di un file venivano trattati come un unico esempio. Qui utilizzerai tf.data.TextLineDataset
, progettato per creare un tf.data.Dataset
da un file di testo in cui ogni esempio è una riga di testo dal file originale. TextLineDataset
è utile per dati di testo basati principalmente su linee (ad esempio poesie o registri degli errori).
Scorri questi file, caricandoli ciascuno nel proprio set di dati. Ogni esempio deve essere etichettato individualmente, quindi usa Dataset.map
per applicare una funzione di etichettatura a ciascuno. Questo eseguirà un'iterazione su ogni esempio nel set di dati, restituendo ( example, label
) coppie.
def labeler(example, index):
return example, tf.cast(index, tf.int64)
labeled_data_sets = []
for i, file_name in enumerate(FILE_NAMES):
lines_dataset = tf.data.TextLineDataset(str(parent_dir/file_name))
labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
labeled_data_sets.append(labeled_dataset)
Successivamente, combinerai questi set di dati etichettati in un unico set di dati utilizzando Dataset.concatenate
e mescolarlo con Dataset.shuffle
:
BUFFER_SIZE = 50000
BATCH_SIZE = 64
VALIDATION_SIZE = 5000
all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
all_labeled_data = all_labeled_data.concatenate(labeled_dataset)
all_labeled_data = all_labeled_data.shuffle(
BUFFER_SIZE, reshuffle_each_iteration=False)
Stampa alcuni esempi come prima. Il set di dati non è stato ancora raggruppato, quindi ogni voce in all_labeled_data
corrisponde a un punto dati:
for text, label in all_labeled_data.take(10):
print("Sentence: ", text.numpy())
print("Label:", label.numpy())
Sentence: b'Beneath the yoke the flying coursers led.' Label: 1 Sentence: b'Too free a range, and watchest all I do;' Label: 1 Sentence: b'defence of their ships. Thus would any seer who was expert in these' Label: 2 Sentence: b'"From morn to eve I fell, a summer\'s day,"' Label: 0 Sentence: b'went to the city bearing a message of peace to the Cadmeians; on his' Label: 2 Sentence: b'darkness of the flying night, and tell it to Agamemnon. This might' Label: 2 Sentence: b"To that distinction, Nestor's son, whom yet" Label: 0 Sentence: b'A sounder judge of honour and disgrace:' Label: 1 Sentence: b'He wept as he spoke, and the elders sighed in concert as each thought' Label: 2 Sentence: b'to gather his bones for the silt in which I shall have hidden him, and' Label: 2
Preparare il set di dati per l'addestramento
Invece di utilizzare tf.keras.layers.TextVectorization
per preelaborare il set di dati di testo, ora utilizzerai le API di TensorFlow Text per standardizzare e tokenizzare i dati, creare un vocabolario e utilizzare tf.lookup.StaticVocabularyTable
per mappare i token a numeri interi da inviare al modello. (Ulteriori informazioni su TensorFlow Text ).
Definisci una funzione per convertire il testo in minuscolo e tokenizzarlo:
- TensorFlow Text fornisce vari tokenizer. In questo esempio, utilizzerai
text.UnicodeScriptTokenizer
per tokenizzare il set di dati. - Utilizzerai
Dataset.map
per applicare la tokenizzazione al set di dati.
tokenizer = tf_text.UnicodeScriptTokenizer()
def tokenize(text, unused_label):
lower_case = tf_text.case_fold_utf8(text)
return tokenizer.tokenize(lower_case)
tokenized_ds = all_labeled_data.map(tokenize)
Puoi scorrere il set di dati e stampare alcuni esempi tokenizzati:
for text_batch in tokenized_ds.take(5):
print("Tokens: ", text_batch.numpy())
Tokens: [b'beneath' b'the' b'yoke' b'the' b'flying' b'coursers' b'led' b'.'] Tokens: [b'too' b'free' b'a' b'range' b',' b'and' b'watchest' b'all' b'i' b'do' b';'] Tokens: [b'defence' b'of' b'their' b'ships' b'.' b'thus' b'would' b'any' b'seer' b'who' b'was' b'expert' b'in' b'these'] Tokens: [b'"' b'from' b'morn' b'to' b'eve' b'i' b'fell' b',' b'a' b'summer' b"'" b's' b'day' b',"'] Tokens: [b'went' b'to' b'the' b'city' b'bearing' b'a' b'message' b'of' b'peace' b'to' b'the' b'cadmeians' b';' b'on' b'his']
Successivamente, creerai un vocabolario ordinando i token per frequenza e mantenendo i token VOCAB_SIZE
alto:
tokenized_ds = configure_dataset(tokenized_ds)
vocab_dict = collections.defaultdict(lambda: 0)
for toks in tokenized_ds.as_numpy_iterator():
for tok in toks:
vocab_dict[tok] += 1
vocab = sorted(vocab_dict.items(), key=lambda x: x[1], reverse=True)
vocab = [token for token, count in vocab]
vocab = vocab[:VOCAB_SIZE]
vocab_size = len(vocab)
print("Vocab size: ", vocab_size)
print("First five vocab entries:", vocab[:5])
Vocab size: 10000 First five vocab entries: [b',', b'the', b'and', b"'", b'of']
Per convertire i token in numeri interi, utilizza il set di tf.lookup.StaticVocabularyTable
vocab
Mapperai i token a numeri interi nell'intervallo [ 2
, vocab_size + 2
]. Come con il livello TextVectorization
, 0
è riservato per indicare il riempimento e 1
è riservato per indicare un token fuori dal vocabolario (OOV).
keys = vocab
values = range(2, len(vocab) + 2) # Reserve `0` for padding, `1` for OOV tokens.
init = tf.lookup.KeyValueTensorInitializer(
keys, values, key_dtype=tf.string, value_dtype=tf.int64)
num_oov_buckets = 1
vocab_table = tf.lookup.StaticVocabularyTable(init, num_oov_buckets)
Infine, definisci una funzione per standardizzare, tokenizzare e vettorizzare il set di dati utilizzando il tokenizer e la tabella di ricerca:
def preprocess_text(text, label):
standardized = tf_text.case_fold_utf8(text)
tokenized = tokenizer.tokenize(standardized)
vectorized = vocab_table.lookup(tokenized)
return vectorized, label
Puoi provare questo su un singolo esempio per stampare l'output:
example_text, example_label = next(iter(all_labeled_data))
print("Sentence: ", example_text.numpy())
vectorized_text, example_label = preprocess_text(example_text, example_label)
print("Vectorized sentence: ", vectorized_text.numpy())
Sentence: b'Beneath the yoke the flying coursers led.' Vectorized sentence: [234 3 811 3 446 749 248 7]
Ora esegui la funzione di preelaborazione sul set di dati usando Dataset.map
:
all_encoded_data = all_labeled_data.map(preprocess_text)
Suddividi il set di dati in set di training e test
Il livello Keras TextVectorization
anche il batch e il riempimento dei dati vettorializzati. Il riempimento è necessario perché gli esempi all'interno di un batch devono avere le stesse dimensioni e forma, ma gli esempi in questi set di dati non hanno tutti le stesse dimensioni: ogni riga di testo ha un numero diverso di parole.
tf.data.Dataset
supporta la suddivisione e la suddivisione in batch di set di dati:
train_data = all_encoded_data.skip(VALIDATION_SIZE).shuffle(BUFFER_SIZE)
validation_data = all_encoded_data.take(VALIDATION_SIZE)
train_data = train_data.padded_batch(BATCH_SIZE)
validation_data = validation_data.padded_batch(BATCH_SIZE)
Ora, validation_data
e train_data
non sono raccolte di coppie ( example, label
), ma raccolte di batch. Ogni batch è una coppia di ( molti esempi , molte etichette ) rappresentati come matrici.
Per illustrare questo:
sample_text, sample_labels = next(iter(validation_data))
print("Text batch shape: ", sample_text.shape)
print("Label batch shape: ", sample_labels.shape)
print("First text example: ", sample_text[0])
print("First label example: ", sample_labels[0])
Text batch shape: (64, 18) Label batch shape: (64,) First text example: tf.Tensor([234 3 811 3 446 749 248 7 0 0 0 0 0 0 0 0 0 0], shape=(18,), dtype=int64) First label example: tf.Tensor(1, shape=(), dtype=int64)
Poiché usi 0
per il riempimento e 1
per i token fuori vocabolario (OOV), la dimensione del vocabolario è aumentata di due:
vocab_size += 2
Configura i set di dati per prestazioni migliori come prima:
train_data = configure_dataset(train_data)
validation_data = configure_dataset(validation_data)
Allena il modello
Puoi addestrare un modello su questo set di dati come prima:
model = create_model(vocab_size=vocab_size, num_labels=3)
model.compile(
optimizer='adam',
loss=losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
history = model.fit(train_data, validation_data=validation_data, epochs=3)
Epoch 1/3 697/697 [==============================] - 27s 9ms/step - loss: 0.5238 - accuracy: 0.7658 - val_loss: 0.3814 - val_accuracy: 0.8306 Epoch 2/3 697/697 [==============================] - 3s 4ms/step - loss: 0.2852 - accuracy: 0.8847 - val_loss: 0.3697 - val_accuracy: 0.8428 Epoch 3/3 697/697 [==============================] - 3s 4ms/step - loss: 0.1924 - accuracy: 0.9279 - val_loss: 0.3917 - val_accuracy: 0.8424
loss, accuracy = model.evaluate(validation_data)
print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))
79/79 [==============================] - 1s 2ms/step - loss: 0.3917 - accuracy: 0.8424 Loss: 0.391705721616745 Accuracy: 84.24%
Esporta il modello
Per rendere il modello in grado di accettare stringhe grezze come input, creerai un livello Keras TextVectorization
che esegue gli stessi passaggi della tua funzione di preelaborazione personalizzata. Dato che hai già addestrato un vocabolario, puoi usare TextVectorization.set_vocabulary
(invece di TextVectorization.adapt
), che allena un nuovo vocabolario.
preprocess_layer = TextVectorization(
max_tokens=vocab_size,
standardize=tf_text.case_fold_utf8,
split=tokenizer.tokenize,
output_mode='int',
output_sequence_length=MAX_SEQUENCE_LENGTH)
preprocess_layer.set_vocabulary(vocab)
export_model = tf.keras.Sequential(
[preprocess_layer, model,
layers.Activation('sigmoid')])
export_model.compile(
loss=losses.SparseCategoricalCrossentropy(from_logits=False),
optimizer='adam',
metrics=['accuracy'])
# Create a test dataset of raw strings.
test_ds = all_labeled_data.take(VALIDATION_SIZE).batch(BATCH_SIZE)
test_ds = configure_dataset(test_ds)
loss, accuracy = export_model.evaluate(test_ds)
print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))
2022-02-05 02:26:40.203675: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:907] Skipping loop optimization for Merge node with control input: sequential_4/text_vectorization_2/UnicodeScriptTokenize/Assert_1/AssertGuard/branch_executed/_185 79/79 [==============================] - 6s 8ms/step - loss: 0.4955 - accuracy: 0.7964 Loss: 0.4955357015132904 Accuracy: 79.64%
La perdita e l'accuratezza per il modello sul set di validazione codificato e il modello esportato sul set di validazione grezza sono le stesse, come previsto.
Esegui l'inferenza su nuovi dati
inputs = [
"Join'd to th' Ionians with their flowing robes,", # Label: 1
"the allies, and his armour flashed about him so that he seemed to all", # Label: 2
"And with loud clangor of his arms he fell.", # Label: 0
]
predicted_scores = export_model.predict(inputs)
predicted_labels = tf.argmax(predicted_scores, axis=1)
for input, label in zip(inputs, predicted_labels):
print("Question: ", input)
print("Predicted label: ", label.numpy())
2022-02-05 02:26:43.328949: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:907] Skipping loop optimization for Merge node with control input: sequential_4/text_vectorization_2/UnicodeScriptTokenize/Assert_1/AssertGuard/branch_executed/_185 Question: Join'd to th' Ionians with their flowing robes, Predicted label: 1 Question: the allies, and his armour flashed about him so that he seemed to all Predicted label: 2 Question: And with loud clangor of his arms he fell. Predicted label: 0
Scarica più set di dati utilizzando TensorFlow Datasets (TFDS)
È possibile scaricare molti più set di dati da TensorFlow Datasets .
In questo esempio, utilizzerai il set di dati IMDB Large Movie Review per addestrare un modello per la classificazione dei sentimenti:
# Training set.
train_ds = tfds.load(
'imdb_reviews',
split='train[:80%]',
batch_size=BATCH_SIZE,
shuffle_files=True,
as_supervised=True)
# Validation set.
val_ds = tfds.load(
'imdb_reviews',
split='train[80%:]',
batch_size=BATCH_SIZE,
shuffle_files=True,
as_supervised=True)
Stampa alcuni esempi:
for review_batch, label_batch in val_ds.take(1):
for i in range(5):
print("Review: ", review_batch[i].numpy())
print("Label: ", label_batch[i].numpy())
Review: b"Instead, go to the zoo, buy some peanuts and feed 'em to the monkeys. Monkeys are funny. People with amnesia who don't say much, just sit there with vacant eyes are not all that funny.<br /><br />Black comedy? There isn't a black person in it, and there isn't one funny thing in it either.<br /><br />Walmart buys these things up somehow and puts them on their dollar rack. It's labeled Unrated. I think they took out the topless scene. They may have taken out other stuff too, who knows? All we know is that whatever they took out, isn't there any more.<br /><br />The acting seemed OK to me. There's a lot of unfathomables tho. It's supposed to be a city? It's supposed to be a big lake? If it's so hot in the church people are fanning themselves, why are they all wearing coats?" Label: 0 Review: b'Well, was Morgan Freeman any more unusual as God than George Burns? This film sure was better than that bore, "Oh, God". I was totally engrossed and LMAO all the way through. Carrey was perfect as the out of sorts anchorman wannabe, and Aniston carried off her part as the frustrated girlfriend in her usual well played performance. I, for one, don\'t consider her to be either ugly or untalented. I think my favorite scene was when Carrey opened up the file cabinet thinking it could never hold his life history. See if you can spot the file in the cabinet that holds the events of his bathroom humor: I was rolling over this one. Well written and even better played out, this comedy will go down as one of this funnyman\'s best.' Label: 1 Review: b'I remember stumbling upon this special while channel-surfing in 1965. I had never heard of Barbra before. When the show was over, I thought "This is probably the best thing on TV I will ever see in my life." 42 years later, that has held true. There is still nothing so amazing, so honestly astonishing as the talent that was displayed here. You can talk about all the super-stars you want to, this is the most superlative of them all!<br /><br />You name it, she can do it. Comedy, pathos, sultry seduction, ballads, Barbra is truly a story-teller. Her ability to pull off anything she attempts is legendary. But this special was made in the beginning, and helped to create the legend that she quickly became. In spite of rising so far in such a short time, she has fulfilled the promise, revealing more of her talents as she went along. But they are all here from the very beginning. You will not be disappointed in viewing this.' Label: 1 Review: b"Firstly, I would like to point out that people who have criticised this film have made some glaring errors. Anything that has a rating below 6/10 is clearly utter nonsense.<br /><br />Creep is an absolutely fantastic film with amazing film effects. The actors are highly believable, the narrative thought provoking and the horror and graphical content extremely disturbing. <br /><br />There is much mystique in this film. Many questions arise as the audience are revealed to the strange and freakish creature that makes habitat in the dark rat ridden tunnels. How was 'Craig' created and what happened to him?<br /><br />A fantastic film with a large chill factor. A film with so many unanswered questions and a film that needs to be appreciated along with others like 28 Days Later, The Bunker, Dog Soldiers and Deathwatch.<br /><br />Look forward to more of these fantastic films!!" Label: 1 Review: b"I'm sorry but I didn't like this doc very much. I can think of a million ways it could have been better. The people who made it obviously don't have much imagination. The interviews aren't very interesting and no real insight is offered. The footage isn't assembled in a very informative way, either. It's too bad because this is a movie that really deserves spellbinding special features. One thing I'll say is that Isabella Rosselini gets more beautiful the older she gets. All considered, this only gets a '4.'" Label: 0
Ora puoi preelaborare i dati e addestrare un modello come prima.
Preparare il set di dati per l'addestramento
vectorize_layer = TextVectorization(
max_tokens=VOCAB_SIZE,
output_mode='int',
output_sequence_length=MAX_SEQUENCE_LENGTH)
# Make a text-only dataset (without labels), then call `TextVectorization.adapt`.
train_text = train_ds.map(lambda text, labels: text)
vectorize_layer.adapt(train_text)
def vectorize_text(text, label):
text = tf.expand_dims(text, -1)
return vectorize_layer(text), label
train_ds = train_ds.map(vectorize_text)
val_ds = val_ds.map(vectorize_text)
# Configure datasets for performance as before.
train_ds = configure_dataset(train_ds)
val_ds = configure_dataset(val_ds)
Crea, configura e addestra il modello
model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=1)
model.summary()
Model: "sequential_5" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_2 (Embedding) (None, None, 64) 640064 conv1d_2 (Conv1D) (None, None, 64) 20544 global_max_pooling1d_2 (Glo (None, 64) 0 balMaxPooling1D) dense_3 (Dense) (None, 1) 65 ================================================================= Total params: 660,673 Trainable params: 660,673 Non-trainable params: 0 _________________________________________________________________
model.compile(
loss=losses.BinaryCrossentropy(from_logits=True),
optimizer='adam',
metrics=['accuracy'])
history = model.fit(train_ds, validation_data=val_ds, epochs=3)
Epoch 1/3 313/313 [==============================] - 3s 7ms/step - loss: 0.5417 - accuracy: 0.6618 - val_loss: 0.3752 - val_accuracy: 0.8244 Epoch 2/3 313/313 [==============================] - 1s 4ms/step - loss: 0.2996 - accuracy: 0.8680 - val_loss: 0.3165 - val_accuracy: 0.8632 Epoch 3/3 313/313 [==============================] - 1s 4ms/step - loss: 0.1845 - accuracy: 0.9276 - val_loss: 0.3217 - val_accuracy: 0.8674
loss, accuracy = model.evaluate(val_ds)
print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))
79/79 [==============================] - 0s 2ms/step - loss: 0.3217 - accuracy: 0.8674 Loss: 0.32172858715057373 Accuracy: 86.74%
Esporta il modello
export_model = tf.keras.Sequential(
[vectorize_layer, model,
layers.Activation('sigmoid')])
export_model.compile(
loss=losses.SparseCategoricalCrossentropy(from_logits=False),
optimizer='adam',
metrics=['accuracy'])
# 0 --> negative review
# 1 --> positive review
inputs = [
"This is a fantastic movie.",
"This is a bad movie.",
"This movie was so bad that it was good.",
"I will never say yes to watching this movie.",
]
predicted_scores = export_model.predict(inputs)
predicted_labels = [int(round(x[0])) for x in predicted_scores]
for input, label in zip(inputs, predicted_labels):
print("Question: ", input)
print("Predicted label: ", label)
Question: This is a fantastic movie. Predicted label: 1 Question: This is a bad movie. Predicted label: 0 Question: This movie was so bad that it was good. Predicted label: 0 Question: I will never say yes to watching this movie. Predicted label: 0
Conclusione
Questo tutorial ha dimostrato diversi modi per caricare e preelaborare il testo. Come passaggio successivo, puoi esplorare ulteriori tutorial di TensorFlow Text sulla preelaborazione del testo, come ad esempio:
Puoi anche trovare nuovi set di dati su TensorFlow Datasets . E, per saperne di più su tf.data
, consulta la guida sulla creazione di pipeline di input .