Apache ORC Reader

View on TensorFlow.org Run in Google Colab View on GitHub Download notebook

Overview

Apache ORC is a popular columnar storage format. tensorflow-io package provides a default implementation of reading Apache ORC files.

Setup

Install required packages, and restart runtime

pipinstalltensorflow-io
importtensorflowastf
importtensorflow_ioastfio
2021年07月30日 12:26:35.624072: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0

Download a sample dataset file in ORC

The dataset you will use here is the Iris Data Set from UCI. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. It has 4 attributes: (1) sepal length, (2) sepal width, (3) petal length, (4) petal width, and the last column contains the class label.

curl-OLhttps://github.com/tensorflow/io/raw/master/tests/test_orc/iris.orc
ls-liris.orc
% Total % Received % Xferd Average Speed Time Time Time Current
 Dload Upload Total Spent Left Speed
100 144 100 144 0 0 1180 0 --:--:-- --:--:-- --:--:-- 1180
100 3328 100 3328 0 0 13419 0 --:--:-- --:--:-- --:--:-- 0
-rw-rw-r-- 1 kbuilder kokoro 3328 Jul 30 12:26 iris.orc

Create a dataset from the file

dataset = tfio.IODataset.from_orc("iris.orc", capacity=15).batch(1)
2021年07月30日 12:26:37.779732: I tensorflow_io/core/kernels/cpu_check.cc:128] Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX2 AVX512F FMA
2021年07月30日 12:26:37.887808: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021年07月30日 12:26:37.979733: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021年07月30日 12:26:37.979781: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (kokoro-gcp-ubuntu-prod-1874323723): /proc/driver/nvidia/version does not exist
2021年07月30日 12:26:37.980766: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021年07月30日 12:26:37.984832: I tensorflow_io/core/kernels/orc/orc_kernels.cc:49] ORC file schema:struct<sepal_length:float,sepal_width:float,petal_length:float,petal_width:float,species:string>

Examine the dataset:

for item in dataset.take(1):
 print(item)
(<tf.Tensor: shape=(1,), dtype=float32, numpy=array([5.1], dtype=float32)>, <tf.Tensor: shape=(1,), dtype=float32, numpy=array([3.5], dtype=float32)>, <tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.4], dtype=float32)>, <tf.Tensor: shape=(1,), dtype=float32, numpy=array([0.2], dtype=float32)>, <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'setosa'], dtype=object)>)
2021年07月30日 12:26:38.167628: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021年07月30日 12:26:38.168103: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2000170000 Hz

Let's walk through an end-to-end example of tf.keras model training with ORC dataset based on iris dataset.

Data preprocessing

Configure which columns are features, and which column is label:

feature_cols = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
label_cols = ["species"]
# select feature columns
feature_dataset = tfio.IODataset.from_orc("iris.orc", columns=feature_cols)
# select label columns
label_dataset = tfio.IODataset.from_orc("iris.orc", columns=label_cols)
2021年07月30日 12:26:38.222712: I tensorflow_io/core/kernels/orc/orc_kernels.cc:49] ORC file schema:struct<sepal_length:float,sepal_width:float,petal_length:float,petal_width:float,species:string>
2021年07月30日 12:26:38.286470: I tensorflow_io/core/kernels/orc/orc_kernels.cc:49] ORC file schema:struct<sepal_length:float,sepal_width:float,petal_length:float,petal_width:float,species:string>

A util function to map species to float numbers for model training:

vocab_init = tf.lookup.KeyValueTensorInitializer(
 keys=tf.constant(["virginica", "versicolor", "setosa"]),
 values=tf.constant([0, 1, 2], dtype=tf.int64))
vocab_table = tf.lookup.StaticVocabularyTable(
 vocab_init,
 num_oov_buckets=4)
label_dataset = label_dataset.map(vocab_table.lookup)
dataset = tf.data.Dataset.zip((feature_dataset, label_dataset))
dataset = dataset.batch(1)
defpack_features_vector(features, labels):
"""Pack the features into a single array."""
 features = tf.stack(list(features), axis=1)
 return features, labels
dataset = dataset.map(pack_features_vector)

Build, compile and train the model

Finally, you are ready to build the model and train it! You will build a 3 layer keras model to predict the class of the iris plant from the dataset you just processed.

model = tf.keras.Sequential(
 [
 tf.keras.layers.Dense(
 10, activation=tf.nn.relu, input_shape=(4,)
 ),
 tf.keras.layers.Dense(10, activation=tf.nn.relu),
 tf.keras.layers.Dense(3),
 ]
)
model.compile(optimizer="adam", loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=["accuracy"])
model.fit(dataset, epochs=5)
Epoch 1/5
150/150 [==============================] - 0s 1ms/step - loss: 1.3479 - accuracy: 0.4800
Epoch 2/5
150/150 [==============================] - 0s 920us/step - loss: 0.8355 - accuracy: 0.6000
Epoch 3/5
150/150 [==============================] - 0s 951us/step - loss: 0.6370 - accuracy: 0.7733
Epoch 4/5
150/150 [==============================] - 0s 954us/step - loss: 0.5276 - accuracy: 0.7933
Epoch 5/5
150/150 [==============================] - 0s 940us/step - loss: 0.4766 - accuracy: 0.7933
<tensorflow.python.keras.callbacks.History at 0x7f263b830850>

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2022年01月10日 UTC.