R interface to TensorFlow Dataset API

Overview

The TensorFlow Dataset API provides various facilities for creating scalable input pipelines for TensorFlow models, including:

  • Reading data from a variety of formats including CSV files and TFRecords files (the standard binary format for TensorFlow training data).

  • Transforming datasets in a variety of ways including mapping arbitrary functions against them.

  • Shuffling, batching, and repeating datasets over a number of epochs.

  • Streaming interface to data for reading arbitrarily large datasets.

  • Reading and transforming data are TensorFlow graph operations, so are executed in C++ and in parallel with model training.

The R interface to TensorFlow datasets provides access to the Dataset API, including high-level convenience functions for easy integration with the keras and tfestimators R packages.

Installation

To use tfdatasets you need to install both the R package as well as TensorFlow itself.

First, install the tfdatasets R package from CRAN as follows:

Then, use the install_tensorflow() function to install TensorFlow:

Creating a Dataset

To create a dataset, use one of the dataset creation functions. Dataset can be created from delimted text files, TFRecords files, as well as from in-memory data.

Text Files

For example, to create a dataset from a text file, first create a specification for how records will be decoded from the file, then call text_line_dataset() with the file to be read and the specification:

<MapDataset shapes: {Sepal.Length: (), Sepal.Width: (), Petal.Length: (), Petal.Width: (),
Species: ()}, types: {Sepal.Length: tf.float32, Sepal.Width: tf.float32, Petal.Length:
tf.float32, Petal.Width: tf.float32, Species: tf.int32}>

In the example above, the csv_record_spec() function is passed an example file which is used to automatically detect column names and types (done by reading up to the first 1,000 lines of the file). You can also provide explicit column names and/or data types using the names and types parameters (note that in this case we don’t pass an example file):

Note that we’ve also specified skip = 1 to indicate that the first row of the CSV that contains column names should be skipped.

Supported column types are integer, double, and character. You can also provide types in a more compact form using single-letter abbreviations (e.g. types = "dddi"). For example:

mtcars_spec <- csv_record_spec("mtcars.csv", types = "dididddiiii")

Parallel Decoding

Decoding lines of text into a record can be computationally expensive. You can parallelize these computations using the parallel_records parameter. For example:

dataset <- text_line_dataset("iris.csv", record_spec = iris_spec, parallel_records = 4)

You can also parallelize the reading of data from storage by requesting that a buffer of records be prefected. You do this with the dataset_prefetch() function. For example:

dataset <- text_line_dataset("iris.csv", record_spec = iris_spec, parallel_records = 4) %>% 
  dataset_prefetch(1000)

If you have multiple input files, you can also parallelize reading of these files both across multiple machines (sharding) and/or on multiple threads per-machine (parallel reads with interleaving). See the section on Reading Multiple Files below for additional details.

TFRecords Files

You can read datasets from TFRecords files using the tfrecord_dataset() function.

In many cases you’ll want to map the records in the dataset into a set of named columns. You can do this using the dataset_map() function along with the tf$parse_single_example() function. for example:

SQLite Databases

You can read datasets from SQLite databases using the sqlite_dataset() function. Note that support for SQLite databases required TensorFlow v1.7 as well as the development version of the tfdatasets package, which you can install via devtools::install_github("rstudio/tfdatasts").

To use sqlite_dataset() you provide the filename of the database, a SQL query to execute, and sql_record_spec() that describes the names and TensorFlow types of columns within the query. For example:

<MapDataset shapes: {disp: (), drat: (), vs: (), gear: (), mpg: (), qsec: (), hp: (), am: (),
wt: (), carb: (), cyl: ()}, types: {disp: tf.float64, drat: tf.int32, vs: tf.float64, gear:
tf.int32, mpg: tf.float64, qsec: tf.float64, hp: tf.float64, am: tf.int32, wt: tf.int32, carb:
tf.int32, cyl: tf.int32}>

Note that for floating point data you must use tf$float64 (reading tf$float32 is not supported for SQLite databases).

Transformations

Mapping

You can map arbitrary transformation functions onto dataset records using the dataset_map() function. For example, to transform the “Species” column into a one-hot encoded vector you would do this:

Note that while dataset_map() is defined using an R function, there are some special constraints on this function which allow it to execute not within R but rather within the TensorFlow graph.

For a dataset created with the csv_dataset() function, the passed record will be named list of tensors (one for each column of the dataset). The return value should be another set of tensors which were created from TensorFlow functions (e.g. tf$one_hot as illustrated above). This function will be converted to a TensorFlow graph operation that performs the transformation within native code.

Parallel Mapping

If these transformations are computationally expensive they can be executed on multiple threads using the num_parallel_calls parameter. For example:

You can control the maximum number of processed elements that will be buffered when processing in parallel using the dataset_prefetch() transformation. For example:

Filtering

You can filter the elements of a dataset using the dataset_filter() function, which takes a predicate function that returns a boolean tensor for records that should be included. For example:

Note that the functions used inside the predicate must be tensor operations (e.g. tf$not_equal, tf$less, etc.). R generic methods for relational operators (e.g. <, >, <=, etc.) and logical operators (e.g. !, &, |, etc.) are provided so you can use shorthand syntax for most common comparisons (as illustrated above).

Features and Response

A common transformation is taking a column oriented dataset (e.g. one created by csv_dataset() or tfrecord_dataset()) and transforming it into a two-element list with features (“x”) and response (“y”). You can use the dataset_prepare() function to do this type of transformation. For example:

mtcars_dataset <- text_line_dataset("mtcars.csv", record_spec = mtcars_spec) %>% 
  dataset_prepare(x = c(mpg, disp), y = cyl)

iris_dataset <- text_line_dataset("iris.csv", record_spec = iris_spec) %>% 
  dataset_prepare(x = -Species, y = Species)

The dataset_prepare() function also accepts standard R formula syntax for defining features and response:

mtcars_dataset <- text_line_dataset("mtcars.csv", record_spec = mtcars_spec) %>% 
  dataset_prepare(cyl ~ mpg + disp)

Shuffling and Batching

There are several functions which control how batches are drawn from the dataset. For example, the following specifies that data will be drawn in batches of 128 from a shuffled window of 1000 records, and that the dataset will be repeated for 10 epochs:

Complete Example

Here’s a complete example of using the various dataset transformation functions together. We’ll read the mtcars dataset from a CSV, filter it on some threshold values, map it into x and y components for modeling, and specify desired shuffling and batch iteration behavior:

dataset <- text_line_dataset("mtcars.csv", record_spec = mtcars_spec) %>%
  dataset_prefetch(1000) %>% 
  dataset_filter(function(record) {
    record$mpg >= 20 & record$cyl >= 6L
  }) %>% 
  dataset_prepare(cyl ~ mpg + disp) %>% 
  dataset_shuffle(1000) %>% 
  dataset_batch(128) %>% 
  dataset_repeat(10)

Reading Datasets

The method for reading data from a TensorFlow Dataset varies depending upon which API you are using to build your models. If you are using the keras or tfestimators packages, then TensorFlow Datasets can be used much like in-memory R matrices and arrays. If you are using the lower-level tensorflow core API then you’ll use explicit dataset iteration functions.

The sections below provide additional details and examples for each of the supported APIs.

keras package

IMPORTANT NOTE: Using TensorFlow Datasets with Keras requires that you are running the very latest versions of Keras (v2.2), TensorFlow (v1.8), and the development versions of the keras and tfdatasets R packages. You can install these packages with:

devtools::install_github(c("rstudio/keras", "rstudio/tfdatasets"))

You can ensure that you have the latest version of the core Keras library with:

Keras models are often trained by passing in-memory arrays directly to the fit function. For example:

However, this requires loading data into an R data frame or matrix before calling fit. You can use the train_on_batch() function to stream data one batch at a time, however the reading and processing of the input data is still being done serially and outside of native code.

Alternatively, Keras enables you to pass a dataset directly as the x argument to fit() and evaluate(). Here’s a complete example that uses datasets to read from TFRecord files containing MNIST digits:

Note that all data preprocessing (e.g. one-hot encoding of the response variable) is done within the dataset_map() operation.

tfestimators package

Models created with tfestimators use an input function to consume data for training, evaluation, and prediction. For example, here is an example of using an input function to feed data from an in-memory R data frame to an estimators model:

model %>% train(
  input_fn(mtcars, features = c(mpg, disp), response = cyl,
           batch_size = 128, epochs = 3)
)

If you are using tfdatasets with the tfestimators package, you can create an estimators input function directly from a dataset as follows:

library(tfestimators)
library(tfdatasets)

mtcars_spec <- csv_record_spec("mtcars.csv")
dataset <- text_line_dataset("mtcars.csv", record_spec = mtcars_spec) %>% 
  dataset_batch(128) %>% 
  dataset_repeat(3)

cols <- feature_columns(
  column_numeric("mpg"),
  column_numeric("disp")
)

model <- linear_regressor(feature_columns = cols)

model %>% train(
  input_fn(dataset, features = c(mpg, disp), response = cyl)
)

Note that we don’t use the dataset_prepare() function in this example. Rather, this function is used under the hood to provide the input_fn interface expected by tfestimators models.

As with dataset_prepare(), you can alternatively use an R formula to specify features and response:

model %>% train(
  input_fn(dataset, cyl ~ mpg + disp)
)

tensorflow package

You read batches of data from a dataset by using tensors that yield the next batch. You can obtain this tensor from a dataset via the make_iterator_one_shot() and iterator_get_next() functions. For example:

$x
Tensor("IteratorGetNext_13:0", shape=(?, 2), dtype=float32)

$y
Tensor("IteratorGetNext_13:1", shape=(?,), dtype=int32)

As you can see next_batch isn’t the data itself but rather a tensor that will yield the next batch of data when it is evaluated:

$x
     [,1] [,2]
[1,] 21.0  160
[2,] 21.0  160
[3,] 22.8  108
[4,] 21.4  258
[5,] 18.7  360

$y
[1] 6 6 4 6 8

If you are iterating over a dataset using these functions, you will need to determine at what point to stop iteration. One approach to this is to use the dataset_repeat() function to create an dataset that yields values infinitely. For example:

In this case the steps variable is used to determine when to stop drawing new batches of training data (we could have equally included code to detect a learning plateau or any other custom method of determining when to stop training).

Another approach is to detect when all batches have been yielded from the dataset. When a dataset iterator reaches the end, an out of range runtime error will occur. You can catch and ignore the error when it occurs by using out_of_range_handler as the error argument to tryCatch(). For example:

Reading Multiple Files

If you have multiple input files you can process them in parallel both across machines (sharding) and/or on multiple threads per-machine (parallel reads with interleaving). The read_files() function provides a high-level interface to parallel file reading.

The read_files() function takes a set of files and a read function along with various options to orchestrate parallel reading. For example, the following function reads all CSV files in a directory using the text_line_dataset() function:

dataset <- read_files("data/*.csv", text_line_dataset, record_spec = mtcars_spec,
                      parallel_files = 4, parallel_interleave = 16) %>% 
  dataset_prefetch(5000) %>% 
  dataset_shuffle(1000) %>% 
  dataset_batch(128) %>% 
  dataset_repeat(3)

The parallel_files argument requests that 4 files be processed in parallel and the parallel_interleave argument requests that blocks of 16 consecutive records from each file be interleaved in the resulting dataset.

Note that because we are processing files in parallel we do not pass the parallel_records argument to text_line_dataset(), since we are already parallelizing at the file level.

Multiple Machines

If you are training on multiple machines and the training supervisor passes a shard index to your training script, you can also parallelizing reading by sharding the file list. For example: