Machine learning is driven by data. Data loading and preprocessing require both efficiency and scalability. OneFlow supports two methods to load data:
One way to do this is to pass a Numpy ndarray object as a parameter to the job function directly.
Another approach is to use DataLoader of OneFlow and its related operators. It can load and pre-process datasets of a particular format from the file system.
Working directly with Numpy data is easy and convenient but only for small amounts of data. Because when the amount of data is too large, there may be barrier in preparing the Numpy data. Therefore, this approach is more suitable for the initial stages of the project to quickly validate and improve the algorithm.
The DataLoader of OneFlow use techniques such as multi-threading and data pipelining which make data loading, data pre-processing more efficient.However, you need to prepare dataset which already supported by Oneflow or develop you own DataLoader for the datatype which not supported by Oneflow. Thus we recommend use that in mature projects.
Use Numpy as Data Input¶
We can directly use Numpy ndarray as data input during training or predicting with OneFlow:
# feed_numpy.py import numpy as np import oneflow as flow import oneflow.typing as tp from typing import Tuple @flow.global_function(type="predict") def test_job( images: tp.Numpy.Placeholder((32, 1, 28, 28), dtype=flow.float), labels: tp.Numpy.Placeholder((32,), dtype=flow.int32), ) -> Tuple[tp.Numpy, tp.Numpy]: # do something with images or labels return (images, labels) if __name__ == "__main__": images_in = np.random.uniform(-10, 10, (32, 1, 28, 28)).astype(np.float32) labels_in = np.random.randint(-10, 10, (32,)).astype(np.int32) images, labels = test_job(images_in, labels_in) print(images.shape, labels.shape)
You can download code from feed_numpy.py and run it by:
(32, 1, 28, 28) (32,)
In the above code, we defined a job function
labels as inputs and annotate (note that the formal parameter is followed by “:” , not “=”) to specifies the shape and data type of the data.
Thus, the example generates Numpy data randomly (
labels_in) according to the shape and data type requirements of the job function.
images_in = np.random.uniform(-10, 10, (32, 1, 28, 28)).astype(np.float32) labels_in = np.random.randint(-10, 10, (32, )).astype(np.int32)
labels_inas parameters when the job function is called.
images, labels = test_job(images_in, labels_in)
oneflow.typing.Numpy.Placeholderis the placeholder of Numpy
ndarray. There are also various placeholders in OneFlow that can represent more complex forms of Numpy data. More details please refer to The Definition and Call of Job Function.
Using DataLoader and Related Operators¶
Under the oneflow.data module, there are DataLoader operators for loading datasets and associated data preprocessing operators.DataLoader is usually named as
data.xxx_reader, such as the existing
data.coco_reader which support OneFlow's native
OFRecord format and COCO dataset.
In addition, there are other data preprocessing operators that are used to process the data after DataLoader has been loaded. The following code uses
data.OFRecordImageDecoderRandomCrop for random image cropping and
data.OFRecordRawDecoder for image decoding. You can refer to the API documentation for more details.
The following example reads the
OFRecord data format file and dealing with images from the ImageNet dataset. The complete code can be downloaded here: of_data_pipeline.py.
This script requires an OFRecord dataset and you can make your own one according to [this article] (. /extended_topics/how_to_make_of_dataset.md).
Or you can download the part-00000 that we have prepared for you which contains 64 images. Then replace
path/to/ImageNet/ofrecord in the script with the directory where the
part-00000 file is located and run the script.
The following example is running a script with our pre-prepared dataset:
wget https://oneflow-public.oss-cn-beijing.aliyuncs.com/online_document/docs/basics_topics/part-00000 sed -i "s:path/to/ImageNet/ofrecord:./:" of_data_pipeline.py python3 of_data_pipeline.py
The following output are expected:
(64, 3, 224, 224) (64,)
There are generally two stages in using OneFlow DataLoader: Load Data and Preprocessing Data.
flow.data.ofrecord_reader in the script is responsible for loading data from the file system into memory.
ofrecord = flow.data.ofrecord_reader( "path/to/ImageNet/ofrecord", batch_size=batch_size, data_part_num=1, part_name_suffix_length=5, random_shuffle=True, shuffle_after_epoch=True, )
To specify the directory where the OFRecord file is located and some other parameters please refer to data.ofrecord_reader.
If the return value of the DataLoader is a basic data type. Then it can be used directly as an input to the downstream operator. Otherwise the data preprocessing operator needs to be called further for preprocessing.
For example, in the script:
image = flow.data.OFRecordImageDecoderRandomCrop( ofrecord, "encoded", color_space=color_space ) label = flow.data.OFRecordRawDecoder( ofrecord, "class/label", shape=(), dtype=flow.int32 ) rsz = flow.image.Resize( image, resize_x=224, resize_y=224, color_space=color_space ) rng = flow.random.CoinFlip(batch_size=batch_size) normal = flow.image.CropMirrorNormalize( rsz, mirror_blob=rng, color_space=color_space, mean=[123.68, 116.779, 103.939], std=[58.393, 57.12, 57.375], output_dtype=flow.float, )
OFRecordImageDecoderRandomCrop is responsible for randomly cropping the image,
OFRecordRawDecoder is responsible for decoding the label directly from the ofrecord object.
image.Resize resizes the cropped image to 224x224 and
CropMirrorNormalize normalizes the image.
More Formats Support by DataLoader¶
OneFlow provides a number of DataLoaders and preprocessing operators, refer to oneflow.data for details. These operators will be enriched and optimized in the future, but users can also refer to this article to customize the DataLoader to meet specific needs.