Performance Analys...

  • 2022-09-23 10:07:24

Performance Analysis of NHWC Data Format in EdgeBoard Embedded AI Solution

EdgeBoard is an embedded AI solution developed by Baidu based on FPGA chips. The high-performance acceleration engine can provide 3.6Tops of powerful computing power. The complete embedded reference design makes hardware integration easy and convenient. At present, EdgeBoard provides two forms of FPGA soft core and computing card module for hardware integration, and also provides two basic hardware products of capture machine and computing box for project deployment. EdgeBoard is deeply compatible with Baidu's brain model resource and tool platform (EasyDL/AI Studio), which greatly reduces the threshold for development verification, product integration, scientific research and teaching, and project deployment. It is suitable for security monitoring, industrial quality inspection, medical diagnosis, crop growth monitoring, Unmanned driving, unmanned retail and other scenarios


Logical vs. Physical Representation of Data Formats

In deep learning, NCHW, NHWC and CHWN data formats are often used to represent data, where N, H, W, and C are defined as follows:

N: the number of pictures in a batch, the number of pictures processed at one time

H: the number of pixels in the vertical height direction, the height of the picture

W: the number of pixels in the horizontal width direction, the width of the picture

C: Number of channels. e.g. 1 for grayscale images and 3 for color RGB images

The following figure shows the data arrangement of N=2, C=16, H=5, W=4, of which the left figure is the logical representation, and the right figure is the physical representation.

Performance Analysis of NHWC Data Format in EdgeBoard Embedded AI Solution

Taking NCHW as an example, its logical representation is shown in the upper left figure. When n=0, the three coordinates identify the directions of C, H, and W respectively. The first element is 000, the second element is along the w direction, ie 001, followed by 002, 003; then along the H direction, ie 004, 005, 006, 007. . This is repeated to 019; then along the C direction, 020, 021, 022. ... .Up to 319 ; then along the N direction, that is, n=1, and then repeat the W direction, the H direction and the C direction.

According to the above division of NCHW, the physical address representation is defined as follows (as shown in the upper right figure):

[a:0] Indicates the W direction, from left to right in one line

[a:1] means from the H direction, line by line from top to bottom

[a:2] means in the C direction, from one channel to another

[a:3] means from the N direction, from n=0 to n=1

The physical distribution (one-dimensional representation in memory) of the final NCHW data format is represented as 000 001 002 003 004 . .. 018 019 020. .. 318 319 320. .. . .. 637 638 639. It can be understood as arranging all the pixels of one channel line by line, and then arranging the next channel, that is, n=1 after n=0 is arranged.

Similarly, NHWC indicates that it is first along the C direction, then along the W direction, then along the H direction, and finally along the N direction. So the storage order in memory is, the first element is 000, and the second is along the C direction, that is, 020, 040, 060. .. until 300, then switch to W direction, 001 021 041 061.. .301.. After reaching 303, switch to the H direction, that is, 004, 024. .. 304, finally to 319, then switch to the N direction, 320, 340. .. all the way up to 639.

[b:0] means C direction, the first pixel goes from one channel to the other

[b:1] means from the W direction, the first pixel of the last channel returns to the second pixel of the first channel

[b:2] In the H direction, the last pixel of the first row of the last channel returns to the first pixel of the second row of the first channel

[b:3] means from the N direction, from n=0 to n=1

The physical representation of NHWC is 000 020 . .. 300 001 021. .. 283 303 004. .. 319 320 340. .. . .. 339 359 . .. 639. It can be understood as arranging all channels of a pixel of a batch first, and then arranging the next pixel. After the arrangement of n=0 is completed, then arrange n=1.

In the same way, CHWN's logical representation, first along the N direction, then along the W direction, then along the H direction, and finally along the C direction.

[c:0] means from the N direction, from the first pixel of n=0 to the first pixel of n=1

[c:1] means from the N direction, from the first pixel of n=1 back to the second pixel of n=0

[c:2] means that in the H direction, from the last pixel of the first row of n=1 to the first pixel of the second row of n=0

[c:3] means from the N direction, from the last pixel of the first channel of n=1 back to the first pixel of the second channel of n=0

CHWN its physical representation is 000 032 001 321 . .. 003 323 004 324 . .. . .. 019 339 020. ... It can be understood that the first pixel of the first channel of the N images in a batch is arranged first, and then the second pixel is arranged; then the second channel and the third channel are arranged. ..

The offset address of the data in memory

A large amount of data calculation is involved in deep learning, and the calculation needs to fetch data from memory, so it is necessary to calculate the offset address of the data for fetching. With the above logical and physical representations, a formula for mapping the 4-dimensional logical representation (n, c, h, w) to an offset address in 1-dimensional memory can be derived.

Define the position (n, c, h, w) to represent the wth column of the hth row of the cth channel of the nth batch, then this position is in different data formats, and the calculation formula of the offset address in the memory is as follows: NCHW : offset_nchw(n, c, h, w) = n * CHW + c * HW + h * W + w NHWC: offset_nhwc(n, c, h, w) = n * HWC + h * WC + w * C + c CHWN: offset_chwn(n, c, h, w) = c * HWN + h * WN + w * N + n where N, C, H, W are constants and n, c, h, w are variables

In NCHW, CHW=C*H*W, which means a Batch, which can be understood as a BGR 3-channel picture, which expresses a cube. HW=H*W, representing a plane, can be understood as a channel of a BGR 3-channel image (a grayscale image is a channel image). W is a line, which can be understood as a line on a channel.

Performance Analysis of NHWC Data Format in EdgeBoard Embedded AI Solution

Take the above picture as an example, if you want to calculate the green circle, that is, the position of 341 (n=1, c=1, h=0, w=1). We need to skip the data (CHW) of n=0 first, the blue box area pointed by arrow 1 in the figure; then skip the first channel (HW) of n=1, the arrow 2 in the figure points to the blue box area; At this time, it enters the second channel of n=1, skips the h=0 line (0*W); finally skips the number of w to reach the offset position.

Why EdgeBoard uses NHWC

Let's analyze the reasons why EdgeBoard chose the NHWC data format.

Performance Analysis of NHWC Data Format in EdgeBoard Embedded AI Solution

The above figure shows the calculation process of convolution. According to the operation characteristics of convolution, the number of all channels in the same position window is multiplied by the parameters of convolution and then accumulated. There are two calculation methods as follows:

Performance Analysis of NHWC Data Format in EdgeBoard Embedded AI Solution

Pixel first then channel: First multiply a channel sliding window with the convolution parameters and then accumulate, and then proceed to the next channel until all channels are multiplied and accumulated. For example, the first sliding window calculation formula:

Performance Analysis of NHWC Data Format in EdgeBoard Embedded AI Solution

It can be seen that the results of the two methods are the same.

For the NHWC format, that is, the channel first and then the pixel, the data of all channels of a pixel are put together. This corresponds to the 3 channel values of the first pixel in the above figure, the 3 channel values of the second pixel, and the 3 channel values of the third pixel, and their addresses in the memory are all consecutive, that is to say, one time The number to be calculated in the first line of the kernel can be taken out, and the 3x3 kernel needs to be taken three times.

For the NCHW format, that is, pixels first, then channels, all pixels of a channel are arranged in order, so for a 3*3 convolution kernel, it needs to jump n numbers for every 3 numbers, and then take 3 number. One channel needs to be taken 3 times, and 3 channels need to be taken 9 times.

In the actual network, the number of channels is usually much larger than the number of convolution kernels (it will not only have 3 channels as in the above figure, usually dozens or hundreds of channels). In this way, for the NHWC format, the number of fetches will be much less than that of NCHW. For EdgeBoard, in order to increase the breadth of the network it supports and reduce the restrictions on the large input size and high storage weight network, the NHWC format can be used to read Feature Map and Weight data to the FPGA in batches. On-chip cache, for example, for a 3x3 Kernel, we can only read the data of three lines (3WC) Feature Map into the FPGA for calculation, then we can get one line of output data and transfer it to the off-chip large-capacity cache DDR without relying on The next 3WC Feature Map input data can complete the input and output data transmission of each batch.

For another example, we can also divide the Weight data into N parts according to the different sizes of the FPGA on-chip cache, send each part to the FPGA for convolution operation, and then transfer it back to the DDR for corresponding splicing, which is equivalent to doing a large one. The advantage of convolution operation is that it can be matched according to FPGA devices of different capacities, which greatly improves the hardware adaptability of the code. In addition, due to the weak data correlation between the C dimensions, using the NHWC format can better utilize the high parallelism computing characteristics of FPGAs and make full use of the computing power of FPGAs.

The following table shows the network performance of EdgeBoard using NHWC data format:

Performance Analysis of NHWC Data Format in EdgeBoard Embedded AI Solution