Recognition of MNIST Handwritten Digits

This article covers topics below:

Configuring the hardware and software environment using the OneFlow interface
Define models using OneFlow's interface
Model training with train type
How to save/load model
Use the predict type for model evaluation
Using predict type for image recognition

This article demonstrates the key steps of how to train a LeNet model with MNIST dataset using OneFlow. The full example code is attached at the end of article.

You can see the effects of each script by running the following commands (The script operation rely on the default GPU No.0 on your machine. If you install the CPU version of OneFlow, the script will automatically call the CPU for training/evaluation).

First of all, clone the documentation repository and switch to the corresponding path:

git clone https://github.com/Oneflow-Inc/oneflow-documentation.git
cd oneflow-documentation/en/docs/code/quick_start/

Training model
```
python lenet_train.py
```
The commands above will train a model with MNIST dataset and save it.

Output：

File mnist.npz already exist, path: ./mnist.npz
5.9947124
1.0865117
0.5317516
0.20937675
0.26428983
0.21764673
0.23443426
...

A trained model is the prerequisite of lenet_eval.py and lenet_test.py. We can directly download a trained model to skip the training progress:

#change directory to: en/docs/code/quick_start/
wget https://oneflow-public.oss-cn-beijing.aliyuncs.com/online_document/docs/quick_start/lenet_models_1.zip
unzip lenet_models_1.zip

Evaluation
```
python lenet_eval.py
```
The command above uses the MNIST's testing set to evaluate the trained model and print out the accuracy.

Output：

File mnist.npz already exist, path: ./mnist.npz
accuracy: 99.4%

Image recognition

python lenet_test.py ./9.png
# Output：prediction: 9

The above command will use the trained model to predict the content of file "9.png". We can also download and verify more from prepared images.

Introduction of MNIST Dataset¶

MNIST is a handwritten digits database including training set and testing set. Training set includes 60000 pictures and their corresponding label. Yann LeCun and others have normalized all the images and packed them into a single binary file for downloading. http://yann.lecun.com/exdb/mnist/

Define Training Model¶

Modules oneflow.nn and oneflow.layers provide the operators to construct the model.

def lenet(data, train=False):
    initializer = flow.truncated_normal(0.1)
    conv1 = flow.layers.conv2d(
        data,
        32,
        5,
        padding="SAME",
        activation=flow.nn.relu,
        name="conv1",
        kernel_initializer=initializer,
    )
    pool1 = flow.nn.max_pool2d(
        conv1, ksize=2, strides=2, padding="SAME", name="pool1", data_format="NCHW"
    )
    conv2 = flow.layers.conv2d(
        pool1,
        64,
        5,
        padding="SAME",
        activation=flow.nn.relu,
        name="conv2",
        kernel_initializer=initializer,
    )
    pool2 = flow.nn.max_pool2d(
        conv2, ksize=2, strides=2, padding="SAME", name="pool2", data_format="NCHW"
    )
    reshape = flow.reshape(pool2, [pool2.shape[0], -1])
    hidden = flow.layers.dense(
        reshape,
        512,
        activation=flow.nn.relu,
        kernel_initializer=initializer,
        name="dense1",
    )
    if train:
        hidden = flow.nn.dropout(hidden, rate=0.5, name="dropout")
    return flow.layers.dense(hidden, 10, kernel_initializer=initializer, name="dense2")

As the code showing above, we build up a LeNet network model.

Implementation of Job Function for Training¶

OneFlow provides a decorator named oneflow.global_function by which we can covert a Python function to a OneFlow Job Function .

`global_function` Decorator¶

oneflow.function_config decorator takes a type parameter to specify the type of job function. The type="tranining" means that the job function is for traning and type="predict" is for predicting.

There is also a function_config parameter taken by oneflow.global_function decorator. The function_config contains configuration about training.

@flow.global_function(type="train")
def train_job(images:tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
              labels:tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32)) -> tp.Numpy:
    # Implementation of netwrok ...

The tp.Numpy.Placeholder is a placeholder. The annotation tp.Numpy on return type means that the job function will return a numpy object.

Setup Optimizer¶

We can use oneflow.optimizer to specify the parameters needed by optimization. By this way, in the process of each iteration during training, OneFlow will take the specified object as optimization goal.

@flow.global_function(type="train")
def train_job(
    images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
    labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
    with flow.scope.placement("gpu", "0:0"):
        logits = lenet(images, train=True)
        loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
            labels, logits, name="softmax_loss"
        )

    lr_scheduler = flow.optimizer.PiecewiseConstantScheduler([], [0.1])
    flow.optimizer.SGD(lr_scheduler, momentum=0).minimize(loss)
    return loss

So Far, we use flow.nn.sparse_softmax_cross_entropy_with_logits to calculate the loss and specify it as optimization goal.

lr_scheduler sets the learning rate schedule, and [0.1] means learning rate is 0.1.
flow.optimizer.SGD means SGD is specified as the optimizer. The loss is the goal of minimization to the optimizer and the return type (not requried).

Calling the Job Function and Get Results¶

We can start training by invoking the job function.

The return value we get when we call the job function is defined by the annotation of return value type in job function.

We can get one or multiple results after each call of job function.

Example on Single Return Value¶

The job function in lenet_train.py:

@flow.global_function(type="train")
def train_job(
    images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
    labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
    with flow.scope.placement("gpu", "0:0"):
        logits = lenet(images, train=True)
        loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
            labels, logits, name="softmax_loss"
        )

    lr_scheduler = flow.optimizer.PiecewiseConstantScheduler([], [0.1])
    flow.optimizer.SGD(lr_scheduler, momentum=0).minimize(loss)
    return loss

The return value in job function is a tp.Numpy. When calling job function, we will get a numpy object:

for epoch in range(20):
        for i, (images, labels) in enumerate(zip(train_images, train_labels)):
            loss = train_job(images, labels)
            if i % 20 == 0:
                print(loss.mean())

We call the train_job and print the loss every 20 iterations.

Example on Multiple Return Values¶

In script lenet_eval.py, we define the job function below:

@flow.global_function(type="predict")
def eval_job(
    images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
    labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> Tuple[tp.Numpy, tp.Numpy]:
    with flow.scope.placement("gpu", "0:0"):
        logits = lenet(images, train=False)
        loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
            labels, logits, name="softmax_loss"
        )

    return (labels, logits)

The return value type of this job function is Tuple[tp.Numpy, tp.Numpy]. When we call the job function, we will get a tuple container. There are two numpy objects in it:

for i, (images, labels) in enumerate(zip(test_images, test_labels)):
            labels, logits = eval_job(images, labels)
            acc(labels, logits)

We call the job function and get labels and logits then use them to evaluate the model.

Synchronous and Asynchronous Call¶

All code in this article only call synchronously to get results from job function. In fact, OneFlow can call job function asynchronously. For more details, please refer to Obtain results from job function.

Model Initialization, Saving and Loading¶

Model Initialization and Saving¶

The example of model saved by the flow.checkpoint.save:

if __name__ == '__main__':
  #data loading and training ...
  flow.checkpoint.save("./lenet_models_1")

When the model is saved, we will get a folder called "lenet_models_1". This folder contains directories and files corresponding with the model parameters.

Model Loading¶

During the prediction process, we can load the parameter from the file to memory by flow.checkpoint.get and then update the parameter to the model by flow.load_variables. For example:

if __name__ == '__main__':
  flow.load_variables(flow.checkpoint.get("./lenet_models_1"))
  #evaluation process  ...

Evaluation of Model¶

The job function for evaluation is basically same as job function for training. The small difference is that the model we use is already saved in evaluation process. Thus, initialization and update of model during iteration are not needed.

Job Function for Evaluation¶

@flow.global_function(type="predict")
def eval_job(
    images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
    labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> Tuple[tp.Numpy, tp.Numpy]:
    with flow.scope.placement("gpu", "0:0"):
        logits = lenet(images, train=False)
        loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
            labels, logits, name="softmax_loss"
        )

    return (labels, logits)

Code above is the implementation of job function for evaluation and its return type is declared as Tuple[tp.Numpy, tp.Numpy]. Tuple have two numpy in it. We will call the job function and calculate the accuracy according to the return values.

Process of Evaluation¶

The acc function is used to count the total number of samples and the number of correct prediction results. We will call the job function to get paramters labels and logits:

g_total = 0
g_correct = 0


def acc(labels, logits):
    global g_total
    global g_correct

    predictions = np.argmax(logits, 1)
    right_count = np.sum(predictions == labels)
    g_total += labels.shape[0]
    g_correct += right_count

Call the job function for evaluation:

if __name__ == "__main__":
    flow.load_variables(flow.checkpoint.get("./lenet_models_1"))
    (train_images, train_labels), (test_images, test_labels) = flow.data.load_mnist(
        BATCH_SIZE, BATCH_SIZE
    )

    for epoch in range(1):
        for i, (images, labels) in enumerate(zip(test_images, test_labels)):
            labels, logits = eval_job(images, labels)
            acc(labels, logits)

    print("accuracy: {0:.1f}%".format(g_correct * 100 / g_total))

So far, we call the job function for evaluation looply and print the accuracy of evaluation result on MNIST testing set.

Image Prediction¶

After making a few changes to the code above, it will take the data from the raw images rather than existing dataset. Then we can get a model to predict the content from the images.

def load_image(file):
    im = Image.open(file).convert("L")
    im = im.resize((28, 28), Image.ANTIALIAS)
    im = np.array(im).reshape(1, 1, 28, 28).astype(np.float32)
    im = (im - 128.0) / 255.0
    im.reshape((-1, 1, 1, im.shape[1], im.shape[2]))
    return im


def main():
    if len(sys.argv) != 2:
        usage()
        return
    flow.load_variables(flow.checkpoint.get("./lenet_models_1"))

    image = load_image(sys.argv[1])
    logits = test_job(image)

    prediction = np.argmax(logits, 1)
    print("prediction: {}".format(prediction[0]))


if __name__ == "__main__":
    main()

Code¶