Dog Breed Identification Using Neural Networks

Created by: Armand Mousavi (amousavi@cs), Vivek Patel (vivekp@cs), and Albert Zhong (azhong@cs)

UW student project for CSE455 22sp

Video

Abstract and Background

FGIC (Fine-Grained Image Classification) is a core problem in modern machine learning research and the discipline of computer vision in particular. The use of image data that is labeled for the purposes of predicting an attribute that is categorical seems clear at first, but presents a huge challenge when considering the amount of possible labels that can be assigned in addition to the distribution of data to both train and test approaches on.

Neural networks (and more specifically, convolutional neural networks) are a key tool used in tackling fine-grained image classification problems. A general neural network architecture for image classification usually involves taking a preprocessed input (common transformations include square-cropping, rotations and zooms on image data to prevent overfitting) and then convolving, activating, and pooling the results to then transform the input into a different shape as to learn higher-order features present in the data. This smaller portion is then often repeated some number of times before one or more fully connected layers with a softmax-esque activation function that yields what can be considered output probabilities for each class of the dependent variable. An example image is presented below.

An example CNN architecture diagram [Source]

Problem Statement

Our goal is to evaluate multiple common image classification networks that are more general on their ability to perform dog breed identification on the Stanford Dogs Dataset.

Data Source

As mentioned above, we utilized the Stanford Dogs Dataset. It features 20850 images total across 120 dog breeds. There are roughly 150 images per dog breed in the dataset, a fairly even distribution, with some variation around that number.

Methodology

In general, the workflow we mentioned in the abstract and background section follows in the approaches we took for our work here.

Gather Dataset
Preprocess Training/Validation/Test Datasets
1. Cropping
2. Flipping
3. Rotation
Train
Evaluate Performance

We compared a few different common general models for image classification:

ResNet-34
ResNet-50
Inception-v3

We leveraged pretrained weights made available by PyTorch, but had to modify the networks to support predictions across 120 labels (the number of different breeds in the dataset). We took the fully-connected layers at the end of each network and removed them, replacing them with layers that have 120 outputs. From there, our training code takes the argmax over the output layer and deems that as the prediction made by the neural network in question.

Experimental Setup and Results

The full training and testing code we used to comapre the models can be viewed in our Colab notebook here.

For each network we trained against the dataset, we generated plots for Training Loss vs. Epoch and Validation Loss vs. Epoch. We utilized a 70-15-15 split for training, validation, and testing. All models were trained for 15 epochs of stochastic gradient descent with a learning rate of 0.01, momentum of 0, and weight decay of 0.0001.

The models we tested performed as follows:

ResNet-34

Training Loss vs. Epoch

Validation Loss vs. Epoch

The resulting accuracy for the network on the test set was roughly 75.6%.

ResNet-50

Training Loss vs. Epoch

Validation Loss vs. Epoch

The resulting accuracy for the network on the test set was roughly 79.68%.

Inception-v3

Training Loss vs. Epoch

Validation Loss vs. Epoch

The resulting accuracy for the network on the test set was roughly 70.88%.

Challenges

The main challenges came down to working with PyTorch to write clean, modularized code for plotting, training, and reshaping the output layer(s) of each network we intended to test against the dataset in question.

In particular, the use of Inception-v3 proved to require a fair bit of debugging due to the fact that Inception-v3 has an auxiliary output layer that we had failed to recognize when we were selecting networks to train. The usual flow for changing the output layer to reflect the number of labels obviously neglected that layer, and as such we needed to change our training code to grab those outputs in order to have the destination variable match the shape of the output coming from the network. Moreover, a separate set of preprocessing transforms needed to be defined as the input sizes to the network were 299x299 as opposed to the 227x227 that both ResNet models took in.

Final Thoughts

In general, it’s no surprise that the pretrained weights for general image recognition were able to perform well in a generic classification environment. In general, it seems that under controlled hyperparameters of training, deeper architectures such as ResNet-50 with more physical weights that affect classification yield better results. If we were to repeat such an experiment in the future, we’d definitely look into resources for longer training, where we free the weights on the pretrained networks and have gradient descent perturb the weights across the entirety of the networks.

Dog Breed Identification Using Neural Networks

Video

Abstract and Background

Problem Statement

Data Source

Methodology

Experimental Setup and Results

ResNet-34

ResNet-50

Inception-v3

Challenges

Final Thoughts

Further Reading

Trash Material Identifier

GalaxyGAN

Drawing With Light

Trending Tags