Hotdog / ! Hotdog image classification with CNN,
using Keras model

Cosmin Linca machinelearning tech articles
August 25, 2020 - 45 min read


"What would you say if I told you there is a app on the market that tell you if you have a hotdog or not a hotdog?"Jian-Yang, Silicon Valley, season 4 episode 4.

This question is one of the best known in the artificial intelligence industry, being introduced by a TV series extremely known and appreciated by technology lovers, namely Silicon Valley.

Even if it sounds trivial, this question hides behind it basic and extremely important concepts of artificial intelligence, with focus on the problem of classification. The process that ultimately determines the recognition of photos that contain a hotdog, compared to any other photo (not hotdog), can then be extrapolated and customized to classify a large number of objects. In the following, we will describe in detail the stages of creating a neural model specialized in classifying images into these two categories.

The objective of this article is simple. Given an image, does it contain a hotdog or not? smiley

Table of Contents

  1. Introduction
  2. Requirements
    2.1 Python and the most useful external libraries
    2.2 Working environment
  3. Dataset - food, food and...more food!
    3.1 Data augmentation
    3.2 Preparing the data
  4. Convolutional neural networks
    4.1 What are convolutional neural networks (CNNs)?
    4.2 How do CNNs work?
    ​ 4.2.1 The convolution layer
    ​ 4.2.2 The pooling layer
    ​ 4.2.3 Fully connected component
    ​ 4.2.4 ReLU activation function
    ​ 4.2.5 CNN's architecture, seen as a whole
  5. Keras model
    5.1 Building the model
    5.2 Compile the model
    5.3 Training (fit) the model
  6. Results
    6.1 Plots
  7. Conclusion

Check the project on Github: click here


2.1 Python and the most useful external libraries

Python will be used for this project, together with specialized libraries such as TensorFlow, Keras or Matplotlib.

Why Python? What is so special and searchable about Python?

For many developers in the field, experienced in established programming languages ​​such as Java, C# or C++, languages ​​that give you more control, switching to Python may seem a little distrustful. The simplicity of the language can give you a state of insecurity, because everything seems too easy.

Let's remember how a simple print looks in Java:

public class Main { public static void main(String[] args) { System.out.println("Hello world!"); } }

It looks good and familiar, a Main class, along with a main function that marks the starting point.

Let's see now how a print looks in Python:

print("Hello world!")

That's it? Wait a minute, and where does it run? Surely you need a setup. The answer is no, nothing is needed. Python is modular, and .py files can be run individually very easily. In general, applications developed in Python also have a starting point, but this modularity property can be very useful in many moments, especially in the development of some structures related to artificial intelligence.

Machine learning and deep learning are two subdomains properly treated by external Python libraries, including NumPy, used for scientific calculations, SciPy for advanced calculations, and scikit-learning for data mining and data analysis, working with external frameworks, such as TensorFlow, Keras, CNTK or Apache Spark.

2.2 Working environment

Regarding the development environment for the project, there are two viable options, namely:

  1. Creating a worksheet running in the cloud, which allows Python code to run online and import used libraries automatically. This type of worksheets can be made available by specialized web applications such as Jupyter notebook or Google Colab;
  2. Create a local project, using a specialized IDE, such as PyCharm. After initializing the project, .py modules can be created and run individually. However, you need to install locally:
    • Python 2.7+
    • Matplotlib (Optional, recommended for exploratory analysis)
    • Libraries like TensorFlow, Keras, Numpy or Sklearn (latest version).

To install an external library, you can use the pip install library_name command. To find out exact name and version of a library, check

3. Dataset - food, food and...more food!

The first step, and probably the most important, is to obtain and process an appropriate dataset. We say about this step that it is the most important because a neural network can be as good as the dataset that learned it. The dataset is the foundation of a correct recognition algorithm (regardless of the domain).

In this sense, we can say that there are not too many difficulties. The field of image classification that represents food types is already vast and frequently exploited, therefore finding a dataset for conducting experiments has not been difficult. Thus, a dataset available online was used, which contains images with 101 types of food, each type having 1000 images available. It can be found here:

Let's take a look at what categories/types of food are found in the dataset:

!ls "/content/drive/My Drive/Colab Notebooks/datasets/food-101/images/"
apple_pie eggs_benedict onion_rings baby_back_ribs escargots oysters ... chocolate_cake hot_dog shrimp_and_grits chocolate_mousse huevos_rancheros spaghetti_bolognese ... dumplings nachos waffles edamame omelette

A lot of food! smiley

As you can see, there is a hot_dog category. Therefore, the data set will be divided as follows:

  1. "Hotdog" class, with class_label = 0, composed of images representing hot dogs;
  2. "Not hotdog" class, having class_label = 1, composed of the rest of the images.

Aren't there few images for hot dogs? 1000 vs. 100000? The answer is depends!

In reality, there are many types of food, the hot dog being one of them. Therefore, the size ratio between categories is balanced. If we had more images with hot dogs, and fewer with other types of food, the neural model would have started to see a hot dog in all sorts of images that do not contain a hot dog. Related to this topic, there is a famous twitter post: If there's ketchup, it's a hotdog.

3.1 Data augmentation

However, the 1000 images may be few for complete learning. To enlarge the data set, an artificial growth technique can be used, known as data augmentation. The 100000 images representing the not hotdog class are quite a lot. Between the two categories, there is a ratio of 99:1 (out of 100 images, 1 is with hotdog, and 99 with not hotdog).

We think there is an imbalance. Therefore, in order to perform a more accurate experiment, we will try to reduce it to a 3:1 ratio (however, we want from our model to learn a little bit better what a hot dog looks like). Thus, we will take from each not hotdog category 750 images. Regarding the images that represent a hot dog, we will apply various augmentation techniques (blur, rotate, resize), in order to increase the number of images to 25000.

For blurring and resizing, we used the methods offered by cv2, a computer vision library (check: as follows:

# Blur image = cv2.GaussianBlur(image, (5, 5), 0) # Resize image = cv2.resize(image, image_size)

For rotating, we create a special method, called rotate_image, which receive for input parameters the image, and a value representing the rotating angle.

def rotate_image(image, angle): (rows, cols, _) = image.shape M = cv2.getRotationMatrix2D((cols / 2, rows / 2), angle, 1) return cv2.warpAffine(image, M, (cols, rows))

In terms of data generation, Keras offers a directly implemented specialized class, called ImageDataGenerator, which is able to create augmented images, through multiple changes (based on special parameters) on the original set of provided images. Check Keras documentation:

3.2 Preparing the data

Let's continue by loading into memory the chosen dataset, to start the image processing stage. If you use Google Colab, you can upload the dataset to Google Drive, and use google colab driver to access the content from the drive. How to do it? Well, it is very simple. Run the following code in your google colab sheet.

from google.colab import drive drive.mount('/content/drive')
This code will return a link, where you have to accept that Colab will access the Google Drive content. And that's it.

The loading and the distinction between hotdog/not hotdog is made as follows:

hotdogs = [] not_hotdogs = [] for food_category in os.listdir(dataset_path_images): if food_category == "hot_dog": # Hotdog - class 0 for hotdog_img in os.listdir(dataset_path_images + "hot_dog/"): hotdogs.append(dataset_path_images + "hot_dog/" + hotdog_img) else: # Not hotdog - class 1 for not_hotdog_img in os.listdir(dataset_path_images + food_category): not_hotdogs.append(dataset_path_images + food_category + "/" + not_hotdog_img)

Nice! Now, let's display an image from each array, using cv2 library:

hotdog_image = cv2.imread(hotdogs[10]) hotdog_image = cv2.resize(hotdog_image, (256, 256)) cv2.normalize(hotdog_image, hotdog_image, 0, 255, cv2.NORM_MINMAX) cv2_imshow(hotdog_image) not_hotdog_image = cv2.imread(not_hotdogs[10]) not_hotdog_image = cv2.resize(not_hotdog_image, (256, 256)) cv2.normalize(not_hotdog_image, not_hotdog_image, 0, 255, cv2.NORM_MINMAX) cv2_imshow(not_hotdog_image)

And the result is:


Great, it looks like we split them right! smiley We have the paths to images, so we can initiate the most complex and time consuming operation. Processing and loading images into memory, along with setting standard sizes for each image. Thus, the following operations will be applied to each image:

using corresponding methods: load_and_process_image, convert_to_grayscale and post_process_data.

4. Convolutional neural networks

4.1 What are convolutional neural networks (CNNs)?

Before defining and explaining what a convolutional neural network is, let's present how a computer understands an image (Figure 2).


As you can see, the image is "broken" into 3 areas, determined by the 3 basic colors: red, green and blue. Each color channel is then used to build the pixels.

Nice, but what does that have to do with convolutional neural networks. And again, what is a CNN? A convolutional neural network (CNN) follows the architecture of a classical neural networks, consisting of artificial neurons, connected by synaptic weights, used to store information. Each neuron can receive information from other neurons (input), information that it's processed, and transmitted it as output, based on an activation function. The whole network has a loss (cost) function and all the properties that we know about neural networks still apply on CNNs.

Pretty straightforward, right? Then, what's different?

The difference is that a convolutional neural network makes an explicit assumption that the input is an image, which allows it to encode certain properties within the architecture. Also, unlike an ordinary network, the layers of a CNN have the neurons arranged in 3 dimensions: width, height and depth.

4.2 How do CNNs work?

In general, convolutional neural networks, also known as ConvNets, uses 4 basic notions (3 types of layers and an activation function), namely:

  1. The convolution layer
  2. The pooling layer
  3. Fully connected component
  4. ReLU activation function

Let's take them one at a time.

4.2.1 The convolution layer

A convolution layer is responsible for scanning an input image, applying a filter of a certain size, in order to extract features that may be important for classification. This filter is also called the convolution kernel. At first, the convolution identifies features in the entire image, then identifies sub-features in smaller parts of the image. This produces one or more images, called feature maps, which contain features of the original image. The image below shows how to apply a filter to an input map, considered as input (Figure 3).


4.2.2. The pooling layer

The role of the pooling layer is to reduce the size of the image and compress the results of a convolution layer. Thus, a 2 x 2 size filter can be successfully applied to a 4 x 4 size image (Figure 4), using one of the following strategies:


4.2.3. Fully connected component

A fully connected component has the role of producing the classification itself. This component is a classical neural network, at the input of which the feature maps are provided, and for the output the softmax activation function is applied. Read more about softmax function in the following article: Deep AI, Softmax Function.

4.2.4. ReLU activation function

ReLU (Rectified Linear Unit) is the most commonly used activation function in building deep learning models. The function returns 0 if it receives a negative input, and for any positive value, that value is returned (Figure 5).


4.2.5 CNN's architecture, seen as a whole

Summarizing the whole process, the image from the input contains an entity that must be included in one of the pre-existing classes. The input is passed to several operations of convolution, pooling and rectification, each time generating feature maps. These contain significant features of the original image.

The features are then passed to a classification process through a fully connected neural network, resulting in a series of probabilities, such that the corresponding image belong to a certain category.

The image below contains an example applied to a complete convolutional neural network, which is specialized in classifying vehicles (Figure 6).


There are numerous articles and lessons about convolutional neural networks, being a method actively used in machine learning. We warmly recommend the following articles:

Also recently (August 12, 2020), Amazon's Machine Learning University announced into an article that it has launched a series of machine learning courses for the general public, available online for free. We managed to view some of these courses, and we can honestly say that they are high quality, and deserve to be visualized in full. Link to the article:

5. Keras model

5.1 Building the model

For the construction of the convolutional model, a sequence of specialized layers was used, placed in a well-established order.

First, an interface, called CNNInteface was created, to define the basic methods to be used in the process of creating, training, testing and plotting a convolutional neural network. Each method from the interface was implemented by the MainCNNRecognition class. At the initialization of this class (using private init method), the structure of the corresponding neural model was defined, as follows:

# Sequencial layer - linear stack of layers self.__model = Sequential() # Convolutional layer self.__model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', kernel_initializer='he_normal', input_shape=input_shape)) # Pooling layer self.__model.add(MaxPooling2D((2, 2))) # Batch Nornmalization - it is a technique for normalizing the activation of # nodes in the intermediate layers of a deep neural network self.__model.add(BatchNormalization()) # Dropout layer - regularization method of neural networks. # It is a technique through which certain randomly selected # neurons are ignored during training self.__model.add(Dropout(0.25)) self.__model.add(Conv2D(64, (3, 3), activation='relu')) self.__model.add(MaxPooling2D(pool_size=(2, 2))) self.__model.add(Dropout(0.25)) self.__model.add(Conv2D(128, (3, 3), activation='relu')) self.__model.add(Dropout(0.3)) # Flatter + Dense - simple fully connected neural network self.__model.add(Flatten()) self.__model.add(Dense(128, activation='relu')) self.__model.add(Dropout(0.3)) self.__model.add(Dense(2, activation='softmax'))

At this point, the architecture of the neural model is defined and prepared. Before starting the training process, the model must be compiled, when the loss function, the optimizer and the calculation metrics are specified.

5.2 Compile the model

Compiling the model is the configuration step of the training process. At this step, 3 experimental parameters will be set, as follows:

self.__model.compile(loss="binary_crossentropy", optimizer=keras.optimizers.Adam(lr=0.0001), metrics=['accuracy'])

5.3 Training (fit) the model

Starting the process of training the convolutional neural model is done by calling the fit() method and specifying some parameters.

First of all, the datasets for training and validation must be specified. Out of a total of 100000 images, 85% were used to train the model (X_train, Y_train), and the remaining 15% for validation and testing (X_test, Y_test). After this step, the number of epochs and the value for batch size must be specified. Thus, 50 epochs was chosen, and 32 for batch size.

self.__history = X_train, Y_train, batch_size=32, epochs=50, validation_data=(X_test, Y_test))

6. Results

After completing the training, the evaluation of the neural model was performed (calling the evaluate() method from the defined interface), in order to calculate the values of the metrics, to create the necessary plots, and to analyze/visualize the results. The table below shows the obtained results:

Loss Accuracy
Training 0.005872 99.89%
Validation 0.077377 98.50%

The observation and interpretation of the results highlight the efficiency and good construction of a convolutional neural network, in the context of binary recognition of images that represent hotdogs or not hotdogs. The accuracy is over 98%, being noticed a little overfitting during training. In order to observe the behavior of the model, we will build and display the confusion matrix. The confusion matrix is a table that measures the performance of a classifier, by specifying the number of correct and wrong predictions, for each class.

Confusion Matrix (code + table):

def confusion_matrix(self, X_test, Y_test_values): # Predict y_pred = self.__model.predict_classes(X_test) Y_test_values = np.argmax(Y_test_values, axis=1) # Generate confusion matrix conf_matrix = confusion_matrix(Y_test_values, y_pred) print(conf_matrix)
Hotdog - predicted Not Hotdog - predicted
Hotdog - real 3578 178
Not Hotdog - real 46 11198

What does the confusion matrix tell us? Well, as we can see, the model manages almost perfectly to recognize not hotdogs. He's not bad at recognizing hotdogs either, but he's still wrong there. Why is he wrong?

Well let's remember how many original images with hotdogs we have...1000; and how many pictures with not hotdogs? 100000, but we used only 75000. Well here may be the problem. Even if we used augmentation algorithms to bring to 25000 the number of images that represent a hotdog, at the base, there are still 1000 different images with hotdogs. Normally, we would need more pictures with hotdogs.

Secondly, it could also be a problem with the architecture. There are several types of well-known, more complex architectures (LeNet, AlexNet, VGG), which behave much better in learning different patterns. The architecture used by us is a custom one, quite simple, but which in the current conditions, does its job very well (we can say that there is a learning process with this architecture, along with the experimental parameters used).

6.1 Plots

For observing the evolution of training and validation processes, and the impact of the whole process on the calculation metrics, two graphs were created, in order to highlight the decrease of the loss and the increase of the accuracy.

drawing drawing

What do the two graphs tell us? As we can see, there is a beginning portion in which the two graphs overlap their increase/decrease. Then, from a certain moment, the line that represents the validation stagnates, without having important increases/decreases, but only fluctuations.

Why does this phenomenon happened? Well, for the same reason stated above. The model reaches its upper limit in terms of learning patterns, taking into account the data provided and its architecture. All it does is to start a fake increase/decrease during training (overfitting).

This overfitting can be removed by attaching an EarlyStopping object when the model is compiled, which detects the increase of loss on validation, and stops the training process. However, this overfitting is not a serious one, as the validation results are also quite good.

7. Conclusion

And that's it! We used Keras and other powerful and useful python libraries to create a convolutional neural model for classifying images into these two categories: hotdogs and not hotdogs. The results obtained are quite good, considering the resources used. Can it be better? Surely! We leave it to you to analyze, and find even more spectacular solutions and implementations.

Maybe this proof of concept is not as spectacular as the one made by Jian-Yang, but we like it very much and we are satisfied! smiley

Thank you for reading this article, and we look forward to your feedback.

All the best!