Machine Learning – Components for Machine Learning on the Edge

11/09/2022 Know-How

Machine learning directly in the device has the potential to revolutionize countless products, whether it be the categorization of objects from an imaging sensor, gestures from an accelerometer, or sentences from an audio stream. However, to achieve this, the algorithms must be run on embedded components.

The development of applications based on machine learning requires management of multiple technical disciplines, but most businesses only have some of these disciplines represented in-house. Data scientists, machine learning engineers, and software developers are then hired to develop, train, fine-tune, and test models for machine learning. The catch here is that these models usually do not run on embedded hardware or mobile devices, because most machine learning engineers have never used models on embedded hardware before and are unfamiliar with the resource limitations. For trained models to be used on mobile SoCs, FPGAs and microprocessors, the model needs to be optimized and quantized.

Semiconductor manufacturers in turn are faced with the task of developing products that satisfy new requirements in relation to performance, cost and form factor—all with strict time-to-market demands. Flexibility is required for interfaces, inputs, outputs and memory usage so that the products can cater to a variety of applications.


TensorFlow Lite Simplifies Optimization and Quantization

This has become somewhat easier in recent years thanks to Google’s TensorFlow Lite. This open source platform for machine learning now also includes scripts that can be used to optimize and quantize machine learning models in a “FlatBuffers” file (*.tflite). It uses parameters configured for a certain application environment.

Ideally, an embedded hardware product should be able to import FlatBuffer files directly from TensorFlow without having to use proprietary or hardware-specific optimization methods outside of the TensorFlow ecosystem. This enables software and hardware engineers to easily use the quantized and optimized FlatBuffer file on FPGAs, SoCs, and microcontrollers.


A Comparison of SoCs, MCUs, and FPGAs

Embedded hardware platforms only have limited resources, are not especially great for development purposes, and are complicated to use. But they do offer low power consumption, low costs, and modules with small dimensions as a reward. What do SoCs, microcontrollers, and FPGAs offer?

SoCs offer the highest performance and many standard interfaces, but also usually have the highest power consumption. The interface-specific inputs and outputs mean that they consume a lot of chip space, which makes them relatively costly.

The advantage of microcontrollers is their very low power consumption and small form factor, but they are often highly limited in terms of machine learning performance and modeling capacity. Models at the high end of the product range usually only offer specialist interfaces, such as for cameras or digital microphones.

FPGAs cover a broad segment between microcontrollers and SoCs. They are available with a wide selection of packages and flexible inputs and outputs. This enables them to support any interface required for a given application without having to waste chip space. The configuration options also enable cost and power consumption to be scaled with performance and allow additional functions to be integrated. The problem with the use of FPGAs for machine learning is their lack of support and integration for SDK platforms such as TensorFlow Lite.


Machine Learning FPGAs

To overcome this flaw, Gowin Semiconductor provides an SDK on its GoAI 2.0 platform that extrapolates models and coefficients, generates C code for the ARM Cortex-M processor integrated in the FPGAs, and generates bitstreams and firmware for the FPGAs.

Another challenge lies in the substantial flash memory and RAM requirements of machine learning models. New hybrid µSoC FPGAs such as the Gowin GW1NSR4P satisfy these requirements by embedding 4 to 8 MB of additional PSRAM. The GW1NSR4P provides a special GoAI 2.0 coprocessor for accelerated processing and storage of folding and pooling layers. It is used in conjunction with its hardware Cortex-M IP core, which controls the layer parameters, model processing, and output results.

Many providers of programmable semiconductors also provide design services programs for steeper learning curves for customers using embedded hardware for machine learning. Gowin is no exception here—the GoAI design services program helps users searching for a one-chip solution for classification or for assistance with implementation for tested, trained models “off the shelf,” but who don’t know how they should communicate with the embedded hardware.

Suppliers provide these kinds of programs to help businesses use fewer resources in relation to embedded machine learning and on implementations on embedded hardware (TinyML) so that they can concentrate more actively on their product development.



Local, embedded machine learning is currently a popular and constantly growing field for many product developers. However, there are considerable challenges, as engineers from a variety of disciplines and fields are required to develop these solutions. Some providers of programmable semiconductors respond to this need by using popular ecosystem tools for embedded hardware and offering devices with flexible interfaces, expanded memory, new software tools, and design services.


For more information and a direct ordering option, please visit our e-commerce platform at

Subscribe to our newsletter and stay updated.

Machine Learning - Components for Machine Learning on the Edge

To enable machine learning to run on embedded hardware, models must be quantized and optimized.

FPGAs offer the right interface and ease of scalability for every application.

With additional PSRAM, Gowin’s GW1NSR4P offers maximum layer breadth and thus enables accelerated processing and storage of folding and pooling layers.