Deep Learning : Who says size doesn't matter ?

Deep Learning : Who says size doesn't matter ?

How to run under 5MB Deep Learning on Windows or Linux edge devices ?

Over the last few years, deep learning methods have been shown to outperform previous state-of-the-art machine learning techniques and other more traditional approaches in a large range of fields, with computer vision being one of the most notable domains. A new challenge arose with the expansion of use cases for Deep Learning and its soaring performances: running deep learning on edge devices. Typically, Deep Learning algorithms would be run on a server, which would process requests from devices and answer them with the algorithm’s outputs. This device-to-server interaction introduces a significant latency to the results, as it requires a back-and-forth communication. In many use cases, such as autonomous vehicles, this delay is not acceptable. To ensure quick processing of the data, Deep Learning algorithms can be operated directly on the devices.

Traditional and on Edge Deep Learning methods

This solution also comes with some issues, the most important one being the size of the libraries and of the models needed to run complex Deep Learning algorithms. At HarfangLab, we tackled this issue thanks to two principles:

1. Using lean dependencies that contain only the code that we need.

2. Reducing the size of the model while preserving performances. In our case, we had to build tflite-runtime (the lean version of tensorflow for prediction) for the desired platforms (Windows 32 and 64 bits), reduce the size of numpy by removing unnecessary code, and shrunk the size of our models by a factor of 10.

Reduction steps

Here is how we completed those steps and now successfully run complex Deep Learning Neural Networks on Windows and Linux devices with python wheels under 5MB. The different custom wheels we built that are needed to run such algorithms will also be available in the following repository: LINK TO REPO

Running Deep Learning with small dependencies

Tensorflow and tensorflow-lite


Tensorflow is one of the most prominent deep learning libraries which provides a simple and efficient framework to implement and train elaborate models. Once the model is trained with this framework, it has the strong advantage of enabling one to simply compress it to a tensorflow-lite model. The inference can then be run using only this model and the tensorflow-lite interpreter (tflite-runtime), discarding all the unnecessary dependencies that would come with the complete tensorflow and tensorflow-lite libraries. Hence, by using tflite-runtime, we avoid getting the full tensorflow library (1.2 GB) or all the tensorflow-lite library (50 MB) and instead retrieve only the 1MB interpreter to load and run the model in inference.

Training and inference with tensorflow and tensorflow-lite

Installing and using tflite-runtime is extremely easy, as one may install it through a wheel with pip install tflite-runtime. Unfortunately, no 64-bits windows wheel exists for tensorflow versions above 2.5.0 (and none exists for 32-bits windows). There are some guidelines on how to build them published by tensorflow here: and

These wheels for tensorflow 2.7.0 are available in our github repository. Here are some extra tips to build them.

Building the wheels

For Linux, you will find the tflite-runtime wheels on PyPI (

To build tflite-runtime for windows x86 and windows x64, along with a lot of courage, you will need to use a windows x64 virtual machine (or your own computer if it is a windows x64). The official CMake build should be used for the task. To run the build, you imperatively need Visual Studio Build Tools 2019 to be installed, as well as Python 64-bits(python -V and python3 -Vshould not produce any errors) and Python 32-bits (which should be the same version as the one chosen for Python 64-bits). You also need numpy, wheeland pybind11 installed.

For x64 and x86, the command to run is tensorflow/lite/tools/pip_package/ windows. We had many issues with this command, here is how we modified the critical lines in thetensorflow/lite/tools/pip_package/ file:

For the x86 build, we added a -A Win32 to precise that the build must be for win32


Now that the tflite-runtime wheels are built, we can easily install them on a Linux or Windows device and run a model using this 1 MB wheel. However, the tflite-runtime wheel has numpy as dependency. Depending on your device, numpyshould be between 10 and 16 MB. It is possible to further optimize the size of the dependencies required to run a model by reducing the size of numpy. To do so, you can build numpy with specific requirements, and manually reduce the size of the built wheel to only keep the core functionalities needed. On a windows x86, on a windows x64, and on a linux, run the following steps:

  • CFLAGS="-g0 -Wl,--strip-all -I/usr/include:/usr/local/include -L/usr/lib:/usr/local/lib" pip install --cache-dir . --compile --global-option=build_ext --global-option="-j 4" numpy
  • Get the built wheel and run wheel unpack name-of-wheel.whl
  • find and remove the file RECORD in the folder unpacked
  • randomly remove some files or directories in the folder unpacked that seem unnecessary (numpy comes with tests, docs, data, etc that significantly increase its size)
  • pack the file again with wheel pack name-of-unpacked-folder
  • install the wheel with pip
  • install tflite-runtime
  • check that your model runs an inference without crashing
  • repeat the deletion of as many random files as possible as long as your model can run an inference

Using this manual method, we managed to get numpy-1.22.3 wheel size to under 3MB.

Reducing the size of deep learning models

After the previous steps, the dependencies required to run Deep Learning algorithms should now only be at around 4MB. The only missing object to run an algorithm is the model itself. The size of the tflite file containing the model grows quickly with the number of parameters of the model. Here are the techniques we used to reduce the size needed to save the model.

Model Quantization

A common and simple way to reduce the size of a model is post-training quantization. This method consists in reducing the number of bits used by the model parameters and operations. Commonly, the model’s parameters are encoded on 32 bits in tensorflow, but they can be reduced to 16-bit floats with little performance losses. More drastically, 32-bit floats can also be converted to 8-bit integers with a more significant performance loss risk. More variations of such techniques are available and straightforward to implement with tensorflow-lite , as detailed in the documentation

Dense and GlobalAveraging layers

In many neural networks (and in convolutional neural networks in particular), the dense layers at the end of the network outputting the predictions represent more than 80% of the parameters of the model. These layers can be critical layers needed for the discrimination of the task’s inputs, but the previous convolutional layers are often also sufficient. Simply removing dense layers partially (or fully by replacing them with a basic global average pooling layer) can considerably reduce the model size without substantially changing the model’s predictions.


This last famous method consists in removing neurons and synapses that contribute less to the model’s outputs. This method is quick to implement with tensorflow but requires more experimentations and fine-tuning than the previous two methods presented to optimize the size/performance trade-off.

Pruning method

Under 5MB Deep Learning on edge devices

By building tensorflow-lite interpreter, garbaging unneeded parts of numpy , customizing our models and reducing their size with the model quantization and pruning, we run Deep Learning algorithms on Windows and Linux edge devices from Python wheels under 5MB!

About HarfangLab

HarfangLab is a cybersecurity software company, created in 2018 by former members of the Ministry of the Defense, major cybersecurity companies and the National Cybersecurity Agency of France (ANSSI), who have more than 25 years of experience in cyber defense.

HarfangLab was created to protect organizations' IT systems while preserving their digital integrity. To reach that goal, the company has developed a sovereign EDR (Endpoint Detection & Response) designed to protect the computers and servers of an IT system. Today, HarfangLab is the only EDR certified by the National Cybersecurity Agency of France (ANSSI).

Also to be seen :

Discover HarfangLab EDR from different angles