Introduction to Hardware Video Encoding with STM32

This message will disappear after all relevant tasks have been resolved.

Semantic MediaWiki

There are 1 incomplete or pending task to finish installation of Semantic MediaWiki. An administrator or user with sufficient rights can complete it. This should be done before adding new data to avoid inconsistencies.

1. Introduction↑

The VENC IP, integrated into the STM32/N6 device, provides hardware acceleration for H.264 and JPEG encoding.

This includes pre-processing for color conversion, cropping, and rotation.

Coupled with the DCMIPP IP, it makes real-time video encoding and streaming possible at a very low CPU load cost.

1.1. H264 supported standard↑

1.1.1. H264 Input Formats↑

YCbCr 4:2:0 planar & semi-planar
YCrCb 4:2:0 semi-planar
YCbYCr & CbYCrY 4:2:2 interleaved
RGB & BRG 444,555, 565, 888

1.1.2. H264 Output Standard↑

Codecs: H264 (MPEG4_Part10/AVC)
Profiles: Baseline/Main/High up to 4.1
Output format: byte unit stream or NAL unit stream
H264 / MVC :Stereo High

The supported image size is from 96x96 to 4080x4080.

Information

Internally, the encoder handles images only in 4:2:0 format

Working with more than 12 bits per pixels input images brings no quality improvement.

Information

VENC is multi-instance capable

VENC can interleave frame encoding of several different streams.

This is because it does not keep the internal status for each stream, but just the reference frames located in external memory.

Each instance may run in parallel with different encoding parameters (resolution, for example).

The following pages focus on a classic use case where only one stream captured by a camera is encoded.

1.2. H264 Encoding Data Flow↑

1.2.1. DCMIPP↑

The DCMIPP (Digital Camera Memory Image Pixel Processor) has a crucial role in real-time encoding use cases.

It is connected to a camera through the CSI-2 serial interface and embeds an ISP (Image Signal Processor) block.

It provides three pipelines as output:

Pipe0 ("dump pipe"): Dumps raw data (not connected to ISP).
Pipe1 ("main pipe"): Feeds the encoder after frame processing (downsizing, color conversion, ...)
Pipe2 ("ancillary pipe"): Used for display (RGB only) or AI, etc...

DCMIPP/Pipe1 may output a VENC-compatible format, typically YUV422 or YUV420.

Coupled with VENC, DCMIPP can be used in two modes:

Frame mode

In this mode, DCMIPP captures a complete frame and stores it in external memory. When the software is signaled for frame completion, it is responsible for calling VENC and providing it with the frame location in memory.

Slices mode / Hardware Handshake

In this mode, DCMIPP captures a certain number of lines (typically 32 lines). When done, it directly signals VENC, which in turn encodes this set of lines. This capture/encode sequence is reiterated until the frame is fully encoded. The software is signaled when the whole frame is finished being encoded. This mode does not require storing the complete input frame. This reduces the memory footprint and allows the input frame to be placed in internal memory. In addition to this reduction in footprint, it results in optimized memory bandwidth.

Frame mode vs Slices Mode: Pros and Cons

	Pros	Cons
Frame Mode	Real-time capture by the DCMIPP is decoupled from offline encoding by the VENC	Extra bandwidth is required to store the frame in external memory (by the DCMIPP) and to retrieve it from the VENC.
Slices Mode	Permits the transfer of the frame using only the internal memory from the DCMIPP to the VENC, thus avoiding the double bandwidth to the external memory	The VENC is coupled with the real-time DCMIPP, so the VENC's pace is mandated to follow the DCMIPP's pace, smoothed only every (typically) 32 lines

Information

Either wording : 'slices mode' or 'Hardware Handshake' mode is used. in the folllowing pages.

The 'Hardware handshake' refers to the protocol between the DCMIPP and the encoder to perform encoding by slices

.

1.2.2. Encode Data Flow Example↑

Information

The Pipe2 path is givens as an example of parallel use case. Another typical use case could be AI (face recognition for example)

.

Camera capture:
- The DCMIPP pipeline supports a maximum resolution of 5 megapixels (after a decimation factor of 1, 2, 4, or 8).
- Therefore, a maximum 40-megapixel raw sensor can be connected (5 megapixels * 8).
- Horizontal and vertical decimation is supported
Uncompressed Frame written in Memory
- Frame format and resolution converted by DCMIPP
- Uncompressed frame may be located in Internal or external memory depending on encoding mode (HW handshake, frame resolution,) and memory available.
- YUV 4:2:0 is the best format for memory footprint and is the native format of the encoder
Uncompressed Frame Frame Read by VENC
VENC Reads Reference Frame from memory (YUV 4:2:0)
- Note that VENC reads Chrominance (UV) data twice resulting in a average bandwidth of 16bpp
VENC Writes Reference Frame to memory (YUV 4:2:0)
- Reference Frame in external memory depending on its resolution and memory available
VENC uses its internal memory (VENCRAM) for encode
VENC writes compressed output stream

2. Code Footprints↑

ro code	ro data	rw data
50 KBytes	62 KBytes	512 Bytes
112 KBytes		512 Bytes

3. VideoBuffers Footprints↑

The memory footprints have been monitored while running real use cases (they are measured, not calculated).

In Frame mode and Hardware Handshake mode
For 1080p, 720p and 480p
With a YUV 4:2:0 input frames format
Encoding with "Main" profile

3.1. Frame Mode Encode↑

Buffer Location	Data	IP	1080p	720p	480p
External RAM	Raw Frame(*)	DCMIPP	5.93 MBytes	2.64 MBytes	1.10 MBytes
External RAM	Reference Frame +VENC buffers (**)	VENC (Internal)	6.55 MBytes	2.92 MBytes	1.25MBytes
Internal SRAM	H264 Stream	VENC (Out)	88.22 KBytes	31.98 KBytes	17.81 KBytes
Total	-	-	12.57 MBytes	5.59 MBytes	2.37 MBytes

(*) Frame mode uses a 'ping-pong' buffer as the input frame.

The total size used for the input raw frame is therefore 2 x Frame Height x Frame Length x 1.5 (12 bits per pixel in YUV 4:2:0).

(**) The encoder may use one or two references frame:

Single buffer: The encoder will use only one reference buffer. This saves some memory but it will restrain the encoder by not being able to discard coded frames.

Using a single buffer should be considered if it is needed to be put in the internal memory.

Double buffer: This gives the encoder a possibility to discard a coded frame to fulfill the requirements of HRD (Hypothetical Reference Decoder). This is the default mode.

The above number is using double buffer.

3.2. Slices Mode Encode↑

The following table shows the memory footprints when encoding 32 line slices.

Buffer Location	Data	IP	1080p	720p	480p
External RAM	Reference Frame + VENC Buffers	VENC (Internal)	4.56 MBytes	2.04 MBytes	0.91 MBytes
Internal SRAM	Raw Frame (Slice)	DCMIPP	90.03 KBytes	60.03 KBytes	37.53 KBytes
Internal SRAM	H264 Stream	VENC (Out)	96.04 KBytes	33.54 KBytes	18.14 KBytes
Total	-	-	4.74 MBytes	2.13 MBytes	0.96 MBytes

Please note that the compressed stream bitrate may vary significantly depending on the input stream and encoding parameters. The numbers above were measured while encoding an almost static image. In any case, it remains obviously insignificant in comparison to the uncompressed buffer.

4. Performances↑

As a standalone peripheral, VENC supports 1080p30 encoding.

Nevertheless, as real-time encoding involves a complete acquisition, encoding, and data transport flow, a bottleneck related to memory bandwidth may be encountered, especially when parallel use cases run simultaneously.

To characterize realistic performances, the following use case has been tested:

Camera acquisition (IMX335 and OV5640)
H.264 encoding (Main profile)
RTSP transmission

The quality of the encoded stream, the frame rate, and the stability were monitored.

The test is considered successful if:

The real frame rate matches the expected frame rate (no frames lost)
The quality of the output stream is good
The use case is stable (overnight or 24-hour tests)

4.1. OV5640↑

Encoding Mode	1080p20(*)	720p30	480p30
Frame	Ok	Ok	Ok
Slices	Ok	Ok	Ok

4.2. IMX335↑

Encoding Mode	1080p20(*)	720p30	480p30
Frame	Ok	Ok	Ok
Slices	Ko (**)	Ok	Ok

(*) 1080p20 is the highest achievable framerate

(**) Please refer to chapter H264 Hardware Handshake encoding for an explanation of the specific issues related to encoding in 'Hardware Handshake' mode.

5. Low Power↑

5.1. Power saving strategy↑

Switch off unused memories & peripherals
Enter Sleep() mode in between each frame capture/processing
- DCMIPP is still working in sleep() mode
- DCMIPP interrupts wakes up the device which perform frame encoding and push to transport before going back to sleep() mode
Decrease CPU/SYS clocks frequency as much as possible
No dynamic change of clock frequencies (could be done for optimization).

5.2. Power consumption↑

To measure realistic performances the following use case was tested:

Camera acquisition (IMX335 )
H.264 encoding (Main profile)
RTSP transmission

The quality of the encoded stream, the frame rate, and the stability were monitored.

The measure is considered valid if:

The real frame rate matches the expected frame rate (no frame lost)
The quality of the output stream is good
The use case is stable

The measurements were done using STLINK-V3 on STM32N6570-DK, reworked for power injection through external SMPS (see H264 power consumption setup).

They are related to VddCore only.

Measurements have been done on a single board and may slightly vary from one board to another.

CPU: 800 MHz / AXI: 400MHz


Resolution / Frame rate	Frame	Slices
480p30	98.37 mW	111.03 mW
720p30	111.32 mW	116.93 mW
1080p15	107.58 mW	112.45 mW
1080p20	117.99 mW	N/A

CPU: 600 MHz / AXI: 400MHz


Resolution / Frame rate	Frame	Slices
480p30	74.04mW	84.41 mW
720p30	84.24 mW	89.25 mW
1080p15	80.97 mW	84.79 mW
1080p20	88.83 mW	N/A

CPU: 25 MHz / AXI: 200MHz


Resolution / Frame rate	Frame	Slices
480p30	50.53 mW	54.05 mW
720p20	51.52 mW	54.22 mW
720p30	58.80 mW	N/A

CPU: 12.5 MHz / AXI: 400MHz


Resolution / Frame rate	Frame	Slices
480p20	56.98 mW	65.94 mW
480p30	61.80 mW	N/A
720p20	63.45 mW	68.81 mW

CPU: 12.5 MHz / AXI: 200MHz


Resolution / Frame rate	Frame	Slices
480p20	46.00 mW	50.21 mW
480p30	50.32 mW	N/A
720p20	51.36 mW	N/A