Ver em TensorFlow.org | Executar no Google Colab | Ver código fonte no GitHub | Baixar notebook |
Este tutorial fornece um exemplo de como carregar dataframe do pandas em um tf.data.Dataset
.
Este tutorial usa um pequeno conjunto de dados fornecido pela Cleveland Clinic Foundation for Heart Disease. Existem várias centenas de linhas no CSV. Cada linha descreve um paciente e cada coluna descreve um atributo. Usaremos essas informações para prever se um paciente tem uma doença cardíaca, que neste conjunto de dados é uma tarefa de classificação binária.
Ler os dados usando pandas
from __future__ import absolute_import, division, print_function, unicode_literals
try:
# %tensorflow_version only exists in Colab.
%tensorflow_version 2.x
except Exception:
pass
import pandas as pd
import tensorflow as tf
Fazer download do arquivo csv que contém o conjunto de dados do coração.
csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/applied-dl/heart.csv')
Downloading data from https://storage.googleapis.com/applied-dl/heart.csv 16384/13273 [=====================================] - 0s 0us/step
Ler o arquivo csv usando pandas.
df = pd.read_csv(csv_file)
df.head()
df.dtypes
age int64 sex int64 cp int64 trestbps int64 chol int64 fbs int64 restecg int64 thalach int64 exang int64 oldpeak float64 slope int64 ca int64 thal object target int64 dtype: object
Converta a coluna thal
, que é um objeto
no dataframe para um valor numérico discreto
df['thal'] = pd.Categorical(df['thal'])
df['thal'] = df.thal.cat.codes
df.head()
Carregar dados usando o tf.data.Dataset
Use tf.data.Dataset.from_tensor_slices
para ler os valores de um dataframe do pandas.
Uma das vantagens do uso do tf.data.Dataset
é que ele permite escrever pipelines de dados simples e altamente eficientes. Leia o loading data guide para obter mais informações.
target = df.pop('target')
dataset = tf.data.Dataset.from_tensor_slices((df.values, target.values))
for feat, targ in dataset.take(5):
print ('Features: {}, Target: {}'.format(feat, targ))
Features: [ 63. 1. 1. 145. 233. 1. 2. 150. 0. 2.3 3. 0. 2. ], Target: 0 Features: [ 67. 1. 4. 160. 286. 0. 2. 108. 1. 1.5 2. 3. 3. ], Target: 1 Features: [ 67. 1. 4. 120. 229. 0. 2. 129. 1. 2.6 2. 2. 4. ], Target: 0 Features: [ 37. 1. 3. 130. 250. 0. 0. 187. 0. 3.5 3. 0. 3. ], Target: 0 Features: [ 41. 0. 2. 130. 204. 0. 2. 172. 0. 1.4 1. 0. 3. ], Target: 0
Como um pd.Series
implementa o protocolo __array__
, ele pode ser usado de forma transparente em praticamente qualquer lugar que você usaria um np.array
ou um tf.Tensor
.
tf.constant(df['thal'])
<tf.Tensor: shape=(303,), dtype=int8, numpy= array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 4, 2, 4, 3, 4, 3, 4, 4, 2, 3, 3, 4, 3, 3, 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4, 2, 3, 3, 4, 3, 4, 3, 3, 4, 4, 3, 3, 4, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 4, 3, 4, 4, 3, 4, 3, 3, 3, 4, 3, 4, 4, 3, 3, 4, 4, 4, 4, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 4, 3, 3, 2, 4, 4, 2, 3, 3, 4, 4, 3, 4, 3, 3, 4, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 4, 3, 2, 4, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 2, 2, 4, 3, 4, 2, 4, 3, 3, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 2, 2, 4, 3, 4, 3, 2, 4, 3, 3, 2, 4, 4, 4, 4, 3, 0, 3, 3, 3, 3, 1, 4, 3, 3, 3, 4, 3, 4, 3, 3, 3, 4, 3, 3, 4, 4, 4, 4, 3, 3, 4, 3, 4, 3, 4, 4, 3, 4, 4, 3, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 3, 2, 4, 4, 4, 4], dtype=int8)>
Aleatório e lote do conjunto de dados.
train_dataset = dataset.shuffle(len(df)).batch(1)
Crirar e treinar um modelo
def get_compiled_model():
model = tf.keras.Sequential([
tf.keras.layers.Dense(10, activation='relu'),
tf.keras.layers.Dense(10, activation='relu'),
tf.keras.layers.Dense(1)
])
model.compile(optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
return model
model = get_compiled_model()
model.fit(train_dataset, epochs=15)
Epoch 1/15 WARNING:tensorflow:Layer dense is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2. The layer has dtype float32 because its dtype defaults to floatx. If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2. To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor. 303/303 [==============================] - 1s 2ms/step - loss: 3.3850 - accuracy: 0.6964 Epoch 2/15 303/303 [==============================] - 1s 2ms/step - loss: 1.8797 - accuracy: 0.6931 Epoch 3/15 303/303 [==============================] - 1s 2ms/step - loss: 1.3348 - accuracy: 0.7063 Epoch 4/15 303/303 [==============================] - 1s 2ms/step - loss: 1.5040 - accuracy: 0.6997 Epoch 5/15 303/303 [==============================] - 1s 2ms/step - loss: 1.0072 - accuracy: 0.7393 Epoch 6/15 303/303 [==============================] - 1s 2ms/step - loss: 0.8372 - accuracy: 0.7822 Epoch 7/15 303/303 [==============================] - 1s 2ms/step - loss: 0.7832 - accuracy: 0.7888 Epoch 8/15 303/303 [==============================] - 1s 2ms/step - loss: 0.7457 - accuracy: 0.7921 Epoch 9/15 303/303 [==============================] - 1s 2ms/step - loss: 0.6368 - accuracy: 0.7789 Epoch 10/15 303/303 [==============================] - 1s 2ms/step - loss: 0.7353 - accuracy: 0.7756 Epoch 11/15 303/303 [==============================] - 1s 2ms/step - loss: 0.6158 - accuracy: 0.8218 Epoch 12/15 303/303 [==============================] - 1s 2ms/step - loss: 0.5253 - accuracy: 0.7954 Epoch 13/15 303/303 [==============================] - 1s 2ms/step - loss: 0.7066 - accuracy: 0.7921 Epoch 14/15 303/303 [==============================] - 1s 2ms/step - loss: 0.6731 - accuracy: 0.7921 Epoch 15/15 303/303 [==============================] - 1s 2ms/step - loss: 0.7600 - accuracy: 0.7756 <tensorflow.python.keras.callbacks.History at 0x7f3f5f32c710>
Alternativa para colunas de características
Passar um dicionário como entrada para um modelo é tão fácil quanto criar um dicionário correspondente de camadas tf.keras.layers.Input
, aplicar qualquer pré-processamento e empilhá-los usando a API funcional. Você pode usar isso como uma alternativa para colunas de características.
inputs = {key: tf.keras.layers.Input(shape=(), name=key) for key in df.keys()}
x = tf.stack(list(inputs.values()), axis=-1)
x = tf.keras.layers.Dense(10, activation='relu')(x)
output = tf.keras.layers.Dense(1)(x)
model_func = tf.keras.Model(inputs=inputs, outputs=output)
model_func.compile(optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
A maneira mais fácil de preservar a estrutura da coluna de um pd.DataFrame
quando usado com tf.data
é converter o pd.DataFrame
em um dict
e dividir esse dicionário.
dict_slices = tf.data.Dataset.from_tensor_slices((df.to_dict('list'), target.values)).batch(16)
for dict_slice in dict_slices.take(1):
print (dict_slice)
({'age': <tf.Tensor: shape=(16,), dtype=int32, numpy= array([63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 57], dtype=int32)>, 'sex': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1], dtype=int32)>, 'cp': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3], dtype=int32)>, 'trestbps': <tf.Tensor: shape=(16,), dtype=int32, numpy= array([145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 130, 120, 172, 150], dtype=int32)>, 'chol': <tf.Tensor: shape=(16,), dtype=int32, numpy= array([233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 256, 263, 199, 168], dtype=int32)>, 'fbs': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0], dtype=int32)>, 'restecg': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0], dtype=int32)>, 'thalach': <tf.Tensor: shape=(16,), dtype=int32, numpy= array([150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, 153, 142, 173, 162, 174], dtype=int32)>, 'exang': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'oldpeak': <tf.Tensor: shape=(16,), dtype=float32, numpy= array([2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, 1.3, 0.6, 0. , 0.5, 1.6], dtype=float32)>, 'slope': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([3, 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1], dtype=int32)>, 'ca': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 3, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'thal': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3], dtype=int32)>}, <tf.Tensor: shape=(16,), dtype=int64, numpy=array([0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0])>)
model_func.fit(dict_slices, epochs=15)
Epoch 1/15 19/19 [==============================] - 0s 2ms/step - loss: 2.8664 - accuracy: 0.6799 Epoch 2/15 19/19 [==============================] - 0s 2ms/step - loss: 1.2796 - accuracy: 0.5842 Epoch 3/15 19/19 [==============================] - 0s 2ms/step - loss: 0.8998 - accuracy: 0.6766 Epoch 4/15 19/19 [==============================] - 0s 3ms/step - loss: 0.8758 - accuracy: 0.6931 Epoch 5/15 19/19 [==============================] - 0s 2ms/step - loss: 0.8052 - accuracy: 0.6964 Epoch 6/15 19/19 [==============================] - 0s 2ms/step - loss: 0.7569 - accuracy: 0.6898 Epoch 7/15 19/19 [==============================] - 0s 2ms/step - loss: 0.7212 - accuracy: 0.6931 Epoch 8/15 19/19 [==============================] - 0s 2ms/step - loss: 0.6975 - accuracy: 0.7063 Epoch 9/15 19/19 [==============================] - 0s 2ms/step - loss: 0.6805 - accuracy: 0.6997 Epoch 10/15 19/19 [==============================] - 0s 2ms/step - loss: 0.6660 - accuracy: 0.7030 Epoch 11/15 19/19 [==============================] - 0s 2ms/step - loss: 0.6535 - accuracy: 0.7096 Epoch 12/15 19/19 [==============================] - 0s 2ms/step - loss: 0.6415 - accuracy: 0.7096 Epoch 13/15 19/19 [==============================] - 0s 2ms/step - loss: 0.6296 - accuracy: 0.7096 Epoch 14/15 19/19 [==============================] - 0s 2ms/step - loss: 0.6207 - accuracy: 0.7129 Epoch 15/15 19/19 [==============================] - 0s 2ms/step - loss: 0.6114 - accuracy: 0.7162 <tensorflow.python.keras.callbacks.History at 0x7f3f8d4789b0>