1. Introduction
The VENC IP, integrated into the STM32/N6 device, provides hardware acceleration for H.264 and JPEG encoding.
This includes pre-processing for color conversion, cropping, and rotation.
Coupled with the DCMIPP IP, it makes real-time video encoding and streaming possible at a very low CPU load cost.
1.1. H264 supported standard
1.1.1. H264 Input Formats
- YCbCr 4:2:0 planar & semi-planar
- YCrCb 4:2:0 semi-planar
- YCbYCr & CbYCrY 4:2:2 interleaved
- RGB & BRG 444,555, 565, 888
1.1.2. H264 Output Standard
- Codecs: H264 (MPEG4_Part10/AVC)
- Profiles: Baseline/Main/High up to 4.1
- Output format: byte unit stream or NAL unit stream
- H264 / MVC :Stereo High
The supported image size is from 96x96 to 4080x4080.
Working with more than 12 bits per pixels input images brings no quality improvement.
VENC can interleave frame encoding of several different streams.
This is because it does not keep the internal status for each stream, but just the reference frames located in external memory.
Each instance may run in parallel with different encoding parameters (resolution, for example).
The following pages focus on a classic use case where only one stream captured by a camera is encoded.
1.2. H264 Encoding Data Flow
1.2.1. DCMIPP
The DCMIPP (Digital Camera Memory Image Pixel Processor) has a crucial role in real-time encoding use cases.
It is connected to a camera through the CSI-2 serial interface and embeds an ISP (Image Signal Processor) block.
It provides three pipelines as output:
- Pipe0 ("dump pipe"): Dumps raw data (not connected to ISP).
- Pipe1 ("main pipe"): Feeds the encoder after frame processing (downsizing, color conversion, ...)
- Pipe2 ("ancillary pipe"): Used for display (RGB only) or AI, etc...
DCMIPP/Pipe1 may output a VENC-compatible format, typically YUV422 or YUV420.
Coupled with VENC, DCMIPP can be used in two modes:
- Frame mode
In this mode, DCMIPP captures a complete frame and stores it in external memory. When the software is signaled for frame completion, it is responsible for calling VENC and providing it with the frame location in memory.
- Slices mode / Hardware Handshake
In this mode, DCMIPP captures a certain number of lines (typically 32 lines). When done, it directly signals VENC, which in turn encodes this set of lines. This capture/encode sequence is reiterated until the frame is fully encoded. The software is signaled when the whole frame is finished being encoded. This mode does not require storing the complete input frame. This reduces the memory footprint and allows the input frame to be placed in internal memory. In addition to this reduction in footprint, it results in optimized memory bandwidth.
- Frame mode vs Slices Mode: Pros and Cons
Pros | Cons | |
---|---|---|
Frame Mode | Real-time capture by the DCMIPP is decoupled from offline encoding by the VENC | Extra bandwidth is required to store the frame in external memory (by the DCMIPP) and to retrieve it from the VENC. |
Slices Mode | Permits the transfer of the frame using only the internal memory from the DCMIPP to the VENC, thus avoiding the double bandwidth to the external memory | The VENC is coupled with the real-time DCMIPP, so the VENC's pace is mandated to follow the DCMIPP's pace, smoothed only every (typically) 32 lines |
.
1.2.2. Encode Data Flow Example
.
- Camera capture:
- The DCMIPP pipeline supports a maximum resolution of 5 megapixels (after a decimation factor of 1, 2, 4, or 8).
- Therefore, a maximum 40-megapixel raw sensor can be connected (5 megapixels * 8).
- Horizontal and vertical decimation is supported
- Uncompressed Frame written in Memory
- Frame format and resolution converted by DCMIPP
- Uncompressed frame may be located in Internal or external memory depending on encoding mode (HW handshake, frame resolution,) and memory available.
- YUV 4:2:0 is the best format for memory footprint and is the native format of the encoder
- Uncompressed Frame Frame Read by VENC
- VENC Reads Reference Frame from memory (YUV 4:2:0)
- Note that VENC reads Chrominance (UV) data twice resulting in a average bandwidth of 16bpp
- VENC Writes Reference Frame to memory (YUV 4:2:0)
- Reference Frame in external memory depending on its resolution and memory available
- VENC uses its internal memory (VENCRAM) for encode
- VENC writes compressed output stream
2. Code Footprints
ro code | ro data | rw data |
---|---|---|
50 KBytes | 62 KBytes | 512 Bytes |
112 KBytes |
3. VideoBuffers Footprints
The memory footprints have been monitored while running real use cases (they are measured, not calculated).
- In Frame mode and Hardware Handshake mode
- For 1080p, 720p and 480p
- With a YUV 4:2:0 input frames format
- Encoding with "Main" profile
3.1. Frame Mode Encode
Buffer Location | Data | IP | 1080p | 720p | 480p |
---|---|---|---|---|---|
External RAM | Raw Frame(*) | DCMIPP | 5.93 MBytes | 2.64 MBytes | 1.10 MBytes |
External RAM | Reference Frame +VENC buffers (**) | VENC (Internal) | 6.55 MBytes | 2.92 MBytes | 1.25MBytes |
Internal SRAM | H264 Stream | VENC (Out) | 88.22 KBytes | 31.98 KBytes | 17.81 KBytes |
Total | - | - | 12.57 MBytes | 5.59 MBytes | 2.37 MBytes |
(*) Frame mode uses a 'ping-pong' buffer as the input frame.
The total size used for the input raw frame is therefore 2 x Frame Height x Frame Length x 1.5 (12 bits per pixel in YUV 4:2:0).
(**) The encoder may use one or two references frame:
Single buffer: The encoder will use only one reference buffer. This saves some memory but it will restrain the encoder by not being able to discard coded frames.
Using a single buffer should be considered if it is needed to be put in the internal memory.
Double buffer: This gives the encoder a possibility to discard a coded frame to fulfill the requirements of HRD (Hypothetical Reference Decoder). This is the default mode.
The above number is using double buffer.
3.2. Slices Mode Encode
The following table shows the memory footprints when encoding 32 line slices.
Buffer Location | Data | IP | 1080p | 720p | 480p |
---|---|---|---|---|---|
External RAM | Reference Frame + VENC Buffers | VENC (Internal) | 4.56 MBytes | 2.04 MBytes | 0.91 MBytes |
Internal SRAM | Raw Frame (Slice) | DCMIPP | 90.03 KBytes | 60.03 KBytes | 37.53 KBytes |
Internal SRAM | H264 Stream | VENC (Out) | 96.04 KBytes | 33.54 KBytes | 18.14 KBytes |
Total | - | - | 4.74 MBytes | 2.13 MBytes | 0.96 MBytes |
Please note that the compressed stream bitrate may vary significantly depending on the input stream and encoding parameters. The numbers above were measured while encoding an almost static image. In any case, it remains obviously insignificant in comparison to the uncompressed buffer.
4. Performances
As a standalone peripheral, VENC supports 1080p30 encoding.
Nevertheless, as real-time encoding involves a complete acquisition, encoding, and data transport flow, a bottleneck related to memory bandwidth may be encountered, especially when parallel use cases run simultaneously.
To characterize realistic performances, the following use case has been tested:
- Camera acquisition (IMX335 and OV5640)
- H.264 encoding (Main profile)
- RTSP transmission
The quality of the encoded stream, the frame rate, and the stability were monitored.
The test is considered successful if:
- The real frame rate matches the expected frame rate (no frames lost)
- The quality of the output stream is good
- The use case is stable (overnight or 24-hour tests)
4.1. OV5640
Encoding Mode | 1080p20(*) | 720p30 | 480p30 |
---|---|---|---|
Frame | Ok | Ok | Ok |
Slices | Ok | Ok | Ok |
4.2. IMX335
Encoding Mode | 1080p20(*) | 720p30 | 480p30 |
---|---|---|---|
Frame | Ok | Ok | Ok |
Slices | Ko (**) | Ok | Ok |
(*) 1080p20 is the highest achievable framerate
(**) Please refer to chapter H264 Hardware Handshake encoding for an explanation of the specific issues related to encoding in 'Hardware Handshake' mode.
5. Low Power
5.1. Power saving strategy
- Switch off unused memories & peripherals
- Enter Sleep() mode in between each frame capture/processing
- DCMIPP is still working in sleep() mode
- DCMIPP interrupts wakes up the device which perform frame encoding and push to transport before going back to sleep() mode
- Decrease CPU/SYS clocks frequency as much as possible
- No dynamic change of clock frequencies (could be done for optimization).
5.2. Power consumption
To measure realistic performances the following use case was tested:
- Camera acquisition (IMX335 )
- H.264 encoding (Main profile)
- RTSP transmission
The quality of the encoded stream, the frame rate, and the stability were monitored.
The measure is considered valid if:
- The real frame rate matches the expected frame rate (no frame lost)
- The quality of the output stream is good
- The use case is stable
The measurements were done using STLINK-V3 on STM32N6570-DK, reworked for power injection through external SMPS (see H264 power consumption setup).
They are related to VddCore only.
Measurements have been done on a single board and may slightly vary from one board to another.
- CPU: 800 MHz / AXI: 400MHz
Resolution / Frame rate | Frame | Slices |
---|---|---|
480p30 | 98.37 mW | 111.03 mW |
720p30 | 111.32 mW | 116.93 mW |
1080p15 | 107.58 mW | 112.45 mW |
1080p20 | 117.99 mW | N/A |
- CPU: 600 MHz / AXI: 400MHz
Resolution / Frame rate | Frame | Slices |
---|---|---|
480p30 | 74.04mW | 84.41 mW |
720p30 | 84.24 mW | 89.25 mW |
1080p15 | 80.97 mW | 84.79 mW |
1080p20 | 88.83 mW | N/A |
- CPU: 25 MHz / AXI: 200MHz
Resolution / Frame rate | Frame | Slices |
---|---|---|
480p30 | 50.53 mW | 54.05 mW |
720p20 | 51.52 mW | 54.22 mW |
720p30 | 58.80 mW | N/A |
- CPU: 12.5 MHz / AXI: 400MHz
Resolution / Frame rate | Frame | Slices |
---|---|---|
480p20 | 56.98 mW | 65.94 mW |
480p30 | 61.80 mW | N/A |
720p20 | 63.45 mW | 68.81 mW |
- CPU: 12.5 MHz / AXI: 200MHz
Resolution / Frame rate | Frame | Slices |
---|---|---|
480p20 | 46.00 mW | 50.21 mW |
480p30 | 50.32 mW | N/A |
720p20 | 51.36 mW | N/A |