View on TensorFlow.org | Run in Google Colab | View on GitHub | Download notebook |
Overview
Apache ORC is a popular columnar storage format. tensorflow-io package provides a default implementation of reading Apache ORC files.
Setup
Install required packages, and restart runtime
pip install tensorflow-io
import tensorflow as tf
import tensorflow_io as tfio
2021-07-30 12:26:35.624072: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Download a sample dataset file in ORC
The dataset you will use here is the Iris Data Set from UCI. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. It has 4 attributes: (1) sepal length, (2) sepal width, (3) petal length, (4) petal width, and the last column contains the class label.
curl -OL https://github.com/tensorflow/io/raw/master/tests/test_orc/iris.orc
ls -l iris.orc
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 144 100 144 0 0 1180 0 --:--:-- --:--:-- --:--:-- 1180 100 3328 100 3328 0 0 13419 0 --:--:-- --:--:-- --:--:-- 0 -rw-rw-r-- 1 kbuilder kokoro 3328 Jul 30 12:26 iris.orc
Create a dataset from the file
dataset = tfio.IODataset.from_orc("iris.orc", capacity=15).batch(1)
2021-07-30 12:26:37.779732: I tensorflow_io/core/kernels/cpu_check.cc:128] Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX2 AVX512F FMA 2021-07-30 12:26:37.887808: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1 2021-07-30 12:26:37.979733: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected 2021-07-30 12:26:37.979781: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (kokoro-gcp-ubuntu-prod-1874323723): /proc/driver/nvidia/version does not exist 2021-07-30 12:26:37.980766: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-07-30 12:26:37.984832: I tensorflow_io/core/kernels/orc/orc_kernels.cc:49] ORC file schema:struct<sepal_length:float,sepal_width:float,petal_length:float,petal_width:float,species:string>
Examine the dataset:
for item in dataset.take(1):
print(item)
(<tf.Tensor: shape=(1,), dtype=float32, numpy=array([5.1], dtype=float32)>, <tf.Tensor: shape=(1,), dtype=float32, numpy=array([3.5], dtype=float32)>, <tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.4], dtype=float32)>, <tf.Tensor: shape=(1,), dtype=float32, numpy=array([0.2], dtype=float32)>, <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'setosa'], dtype=object)>) 2021-07-30 12:26:38.167628: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2) 2021-07-30 12:26:38.168103: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2000170000 Hz
Let's walk through an end-to-end example of tf.keras model training with ORC dataset based on iris dataset.
Data preprocessing
Configure which columns are features, and which column is label:
feature_cols = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
label_cols = ["species"]
# select feature columns
feature_dataset = tfio.IODataset.from_orc("iris.orc", columns=feature_cols)
# select label columns
label_dataset = tfio.IODataset.from_orc("iris.orc", columns=label_cols)
2021-07-30 12:26:38.222712: I tensorflow_io/core/kernels/orc/orc_kernels.cc:49] ORC file schema:struct<sepal_length:float,sepal_width:float,petal_length:float,petal_width:float,species:string> 2021-07-30 12:26:38.286470: I tensorflow_io/core/kernels/orc/orc_kernels.cc:49] ORC file schema:struct<sepal_length:float,sepal_width:float,petal_length:float,petal_width:float,species:string>
A util function to map species to float numbers for model training:
vocab_init = tf.lookup.KeyValueTensorInitializer(
keys=tf.constant(["virginica", "versicolor", "setosa"]),
values=tf.constant([0, 1, 2], dtype=tf.int64))
vocab_table = tf.lookup.StaticVocabularyTable(
vocab_init,
num_oov_buckets=4)
label_dataset = label_dataset.map(vocab_table.lookup)
dataset = tf.data.Dataset.zip((feature_dataset, label_dataset))
dataset = dataset.batch(1)
def pack_features_vector(features, labels):
"""Pack the features into a single array."""
features = tf.stack(list(features), axis=1)
return features, labels
dataset = dataset.map(pack_features_vector)
Build, compile and train the model
Finally, you are ready to build the model and train it! You will build a 3 layer keras model to predict the class of the iris plant from the dataset you just processed.
model = tf.keras.Sequential(
[
tf.keras.layers.Dense(
10, activation=tf.nn.relu, input_shape=(4,)
),
tf.keras.layers.Dense(10, activation=tf.nn.relu),
tf.keras.layers.Dense(3),
]
)
model.compile(optimizer="adam", loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=["accuracy"])
model.fit(dataset, epochs=5)
Epoch 1/5 150/150 [==============================] - 0s 1ms/step - loss: 1.3479 - accuracy: 0.4800 Epoch 2/5 150/150 [==============================] - 0s 920us/step - loss: 0.8355 - accuracy: 0.6000 Epoch 3/5 150/150 [==============================] - 0s 951us/step - loss: 0.6370 - accuracy: 0.7733 Epoch 4/5 150/150 [==============================] - 0s 954us/step - loss: 0.5276 - accuracy: 0.7933 Epoch 5/5 150/150 [==============================] - 0s 940us/step - loss: 0.4766 - accuracy: 0.7933 <tensorflow.python.keras.callbacks.History at 0x7f263b830850>