How image manipulations deteriorate deep learning model performances

How image manipulations deteriorate deep learning model performances

Removing image manipulation approximations from image-like modelling in Deep Learning

Deep learning for image analysis

With Convolutional Neural Networks (CNN) in Deep Learning, the computer vision field has had significant breakthroughs in recent years in the processing and classification of images. This progress can be leveraged in many different tasks, ranging from medical monitoring to speech recognition (as shown by Microsoft in this article).

These approaches are complementary to other techniques, such as other machine learning methods. Where the latter techniques run an in-depth analysis of specific features of objects, CNN solutions survey the general form of the image-like objects to make a prediction. Deep learning doesn't rely on features but self-computes them implicitly instead. Since deep learning methods produce predictions only from the general structure of images, they benefit from a capacity for generalization, allowing them to perform well on images they have never seen before. This gives deep learning a different perspective that other techniques lack and makes it relevant in completing the predictions of other techniques.

In many tasks, the source objects can be modelized as images, even though they seem very different at first, such as sounds or executable files. CNN methods based on images are initially incompatible with these source objects. However, when representing these source objects as images, the CNN field and its techniques become available to solve prediction tasks on the image-like representation of the objects.

How image manipulations deteriorate deep learning model performances

Issue description

Once the source object is cast into an image, it can be treated exactly as such: displayed, reshaped, resized, saved, or loaded.

Many tools exist to perform such operations and the most popular in Python would be OpenCV and Pillow. These libraries allow us to easily manipulate images and prepare them to be given as input to a CNN by running the aforementioned actions.

However, these seemingly innocuous processing may lead to massive performance deterioration: just saving and reloading the exact same file could change a CNN's prediction!

After resizing the original image with OpenCV and Pillow, the two images here look completely identical, but are they really?

Analyzing the root problem

OpenCV and Pillow intend to manipulate the data based on human perception. Manipulating images with them can quickly lead to many different pixel-level approximations. Such differences between the two images may seem invisible to humans, but they represent drastically different information for a CNN.

The absolute difference between the OpenCV and Pillow resize done previously is displayed here. If there were no differences, the result would have been a black image.

It is important to note that the resize with OpenCV and the one with Pillow have been done with the same basic resizing method, bilinear interpolation. The executable files have various sizes, ranging from a few kilobytes to hundreds of megabytes.

This makes it mandatory to resize the image representation of the executable files to provide an input with a uniform size to the CNN.

In computer vision applications with real-world images, the approximated images still retain the important features of the original images.

However, when applying CNNs to images that are only a representation of different source objects, the main features of the original objects are compromised by those approximations. Slight differences in only a few pixels results in a different output prediction by the CNN.

Here is an example of this output difference:

The model is trained using OpenCV’s resize. After resize, the exact same sound is being predicted very differently if the resize is still done with OpenCV or if it's done with Pillow.

When dealing with image recognition, this problem does not really occur. Even though there are slight differences in the two images above, they both display an owl. However, those slight differences can compromise the very essence of the source objects.

Improving model performances

Correctly formatting the image and tracking down the exact modifications done by the different tools used in each step when manipulating the image is critical to the improvement of malware detection with deep learning. Many different computer vision applications can benefit from this practice, by improving the knowledge of the data state and ensuring the reproducibility of the predictions.

More practically, and as a general rule of thumb, only a single image manipulation tool should be selected in a project. One last but not least tip: OpenCV and Pillow should always be used with arrays of uint8, otherwise the output will undergo unknown approximations and transformations that may jeopardize the model predictions reproducibility.

About HarfangLab

HarfangLab is a cybersecurity software company, created in 2018 by former members of the Ministry of the Defense, major cybersecurity companies and the National Cybersecurity Agency of France (ANSSI), who have more than 25 years of experience in cyber defense.

HarfangLab was created to protect organizations' IT systems while preserving their digital integrity. To reach that goal, the company has developed a sovereign EDR (Endpoint Detection & Response) designed to protect the computers and servers of an IT system. Today, HarfangLab is the only EDR certified by the National Cybersecurity Agency of France (ANSSI).

Also to be seen :

Discover HarfangLab EDR from different angles