Skip to content

Data Input

Machine learning is driven by data. Data loading and preprocessing require both efficiency and scalability. OneFlow supports two methods to load data:

  • One way to do this is to pass a Numpy ndarray object as a parameter to the job function directly.

  • Another approach is to use DataLoader of OneFlow and its related operators. It can load and pre-process datasets of a particular format from the file system.

Working directly with Numpy data is easy and convenient but only for small amounts of data. Because when the amount of data is too large, there may be barrier in preparing the Numpy data. Therefore, this approach is more suitable for the initial stages of the project to quickly validate and improve the algorithm.

The DataLoader of OneFlow use techniques such as multi-threading and data pipelining which make data loading, data pre-processing more efficient.However, you need to prepare dataset which already supported by Oneflow or develop you own DataLoader for the datatype which not supported by Oneflow. Thus we recommend use that in mature projects.

Use Numpy as Data Input

Example

We can directly use Numpy ndarray as data input during training or predicting with OneFlow:

# feed_numpy.py
import numpy as np
import oneflow as flow
import oneflow.typing as tp
from typing import Tuple


@flow.global_function(type="predict")
def test_job(
    images: tp.Numpy.Placeholder((32, 1, 28, 28), dtype=flow.float),
    labels: tp.Numpy.Placeholder((32,), dtype=flow.int32),
) -> Tuple[tp.Numpy, tp.Numpy]:
    # do something with images or labels
    return (images, labels)


if __name__ == "__main__":
    images_in = np.random.uniform(-10, 10, (32, 1, 28, 28)).astype(np.float32)
    labels_in = np.random.randint(-10, 10, (32,)).astype(np.int32)
    images, labels = test_job(images_in, labels_in)
    print(images.shape, labels.shape)

You can download code from feed_numpy.py and run it by:

python3 feed_numpy.py
Following output are expected:
(32, 1, 28, 28) (32,)

Code Explanation

In the above code, we defined a job function test_job() with images and labels as inputs and annotate (note that the formal parameter is followed by “:” , not “=”) to specifies the shape and data type of the data.

Thus, the example generates Numpy data randomly (images_in and labels_in) according to the shape and data type requirements of the job function.

 images_in = np.random.uniform(-10, 10, (32, 1, 28, 28)).astype(np.float32)
  labels_in = np.random.randint(-10, 10, (32, )).astype(np.int32)
Then directly pass the Numpy data images_in and labels_in as parameters when the job function is called.
images, labels = test_job(images_in, labels_in)
The oneflow.typing.Numpy.Placeholder is the placeholder of Numpy ndarray. There are also various placeholders in OneFlow that can represent more complex forms of Numpy data. More details please refer to The Definition and Call of Job Function.

Under the oneflow.data module, there are DataLoader operators for loading datasets and associated data preprocessing operators.DataLoader is usually named as data.xxx_reader, such as the existing data.ofrecord_reader and data.coco_reader which support OneFlow's native OFRecord format and COCO dataset.

In addition, there are other data preprocessing operators that are used to process the data after DataLoader has been loaded. The following code uses data.OFRecordImageDecoderRandomCrop for random image cropping and data.OFRecordRawDecoder for image decoding. You can refer to the API documentation for more details.

Examples

The following example reads the OFRecord data format file and dealing with images from the ImageNet dataset. The complete code can be downloaded here: of_data_pipeline.py.

This script requires an OFRecord dataset and you can make your own one according to [this article] (. /extended_topics/how_to_make_of_dataset.md).

Or you can download the part-00000 that we have prepared for you which contains 64 images. Then replace path/to/ImageNet/ofrecord in the script with the directory where the part-00000 file is located and run the script.

The following example is running a script with our pre-prepared dataset:

wget https://oneflow-public.oss-cn-beijing.aliyuncs.com/online_document/docs/basics_topics/part-00000
sed -i "s:path/to/ImageNet/ofrecord:./:" of_data_pipeline.py
python3 of_data_pipeline.py

The following output are expected:

(64, 3, 224, 224) (64,)

Code Explanation

There are generally two stages in using OneFlow DataLoader: Load Data and Preprocessing Data.

flow.data.ofrecord_reader in the script is responsible for loading data from the file system into memory.

    ofrecord = flow.data.ofrecord_reader(
        "path/to/ImageNet/ofrecord",
        batch_size=batch_size,
        data_part_num=1,
        part_name_suffix_length=5,
        random_shuffle=True,
        shuffle_after_epoch=True,
    )

To specify the directory where the OFRecord file is located and some other parameters please refer to data.ofrecord_reader.

If the return value of the DataLoader is a basic data type. Then it can be used directly as an input to the downstream operator. Otherwise the data preprocessing operator needs to be called further for preprocessing.

For example, in the script:

    image = flow.data.OFRecordImageDecoderRandomCrop(
        ofrecord, "encoded", color_space=color_space
    )
    label = flow.data.OFRecordRawDecoder(
        ofrecord, "class/label", shape=(), dtype=flow.int32
    )
    rsz = flow.image.Resize(
        image, resize_x=224, resize_y=224, color_space=color_space
    )
    rng = flow.random.CoinFlip(batch_size=batch_size)
    normal = flow.image.CropMirrorNormalize(
        rsz,
        mirror_blob=rng,
        color_space=color_space,
        mean=[123.68, 116.779, 103.939],
        std=[58.393, 57.12, 57.375],
        output_dtype=flow.float,
    )

OFRecordImageDecoderRandomCrop is responsible for randomly cropping the image, OFRecordRawDecoder is responsible for decoding the label directly from the ofrecord object. image.Resize resizes the cropped image to 224x224 and CropMirrorNormalize normalizes the image.

More Formats Support by DataLoader

OneFlow provides a number of DataLoaders and preprocessing operators, refer to oneflow.data for details. These operators will be enriched and optimized in the future, but users can also refer to this article to customize the DataLoader to meet specific needs.