Nanodet, a 1.8m ultra light target detection model, runs faster than Yolo, and its star volume exceeds 200 in two days

How to transplant anchor free model to mobile terminal or embedded device? In this project, the three modules of the single-stage detection model are lightweight, and nanodet-m, which is only 1.8m in size and super fast in speed, is obtained. < p > < p > object detection has always been a big problem in the field of computer vision. Its goal is to find all the regions of interest in the image, and determine the location and category of these regions. Deep learning methods in target detection have been developed for many years, and different types of detection methods have appeared. At present, deep learning target detection methods are mainly divided into two categories, namely two-stage target detection algorithm and single-stage target detection algorithm. The two-stage target detection framework first generates candidate regions, and then classifies them into different target categories, such as r-cnn, fast r-cnn, etc.; the single-stage target detection framework regards the target detection task as a unified end-to-end regression problem, and the representative models are multibox, Yolo, SSD, etc. Such frameworks are usually simpler in structure and faster in detection. The methods of deep learning target detection can also be divided into anchor base and anchor free. This year, there have been attempts and various methods are blooming. However, in the mobile target detection algorithm, Yolo series and SSD and other anchor base models have been dominant. Recently, a project named nanodet has appeared on GitHub. It opens up a real-time anchor free detection model for mobile terminals. It hopes to provide performance no less than that of Yolo series, and is also convenient for training and transplantation. Only two days after the project was launched, the number of stars has exceeded 200. < / P > < p > nanodet is an ultra fast and lightweight anchor free target detection model. The model has the following advantages: < / P > < p > although the model is very lightweight, its performance can not be underestimated. When compared with other models, the author of the project chooses coco map as the evaluation index, taking into account the accuracy of detection and positioning, and tests on 5 000 images of coco Val, and does not use testing time augmentation. Under this setting, 320 resolution input can achieve 20.6 map, which is 4 points higher than tiny yolov3 and only 1 percentage point lower than yolov4 tiny. When the input resolution was consistent with Yolo and 416 input was used, nanodet and yolov4 tiny scores were equal. The specific results are shown in the following table: < / P > < p > in addition, the project author deployed ncnn on the mobile phone and ran a benchmark. The forward calculation time of the model is only about 10 ms, while yolov3 and V4 tiny are in the order of 30 ms. On the Android camera demo app, including the time of image preprocessing, post-processing of detection frame and drawing detection frame, nanodet can easily run to 40 + FPS. According to the author of the project, the main purpose of the project is to open source a real-time anchor free detection model for mobile terminal, which can provide performance no less than Yolo series, and facilitate training and transplantation. For this reason, he referred to the following research: < / P > < p > the author of the project wants to implement an fcos style anchor free target detection model, but when the fcos is lightweight, because the centrality branch of fcos is difficult to converge on the lightweight model, the effect of the model is not as expected. Finally, nanodet uses the generalized focal loss function proposed by Li Xiang et al. This function can remove the centrerness branch of fcos, and save a lot of convolution on this branch, so as to reduce the computational overhead of detection head, which is very suitable for lightweight deployment of mobile terminal. < / P > < p > fcos series uses the same set of convolution prediction detection frames for multi-scale feature maps from FPN, and then each layer uses a learnable scale value as a coefficient to scale the predicted frames. < / P > < p > the advantage of this method is that it can reduce the number of parameters of the detection head to 1 / 5 of that in the state of sharing no weight. This is very useful for large models with hundreds of convolution channels, but for lightweight models, sharing weight detector heads is not of great significance. Since the model reasoning of mobile terminal is calculated by CPU, the shared weight will not accelerate the reasoning process, and when the detection head is very light, the sharing weight will further reduce the detection ability. Therefore, the project author thinks that it is appropriate to use a set of volume for each layer of features. At the same time, fcos series uses group normalization as the normalization method in the detector head. GN has many advantages over BN, but it has one disadvantage: BN can directly fuse its normalized parameters into convolution in reasoning, which can avoid this step of calculation, but GN can’t. In order to save the time of normalization operation, the project author chooses to replace GN with BN. The detector head of fcos uses four 256 channel convolutions as a branch, that is to say, there are 8 convolutions with C = 256 on the two branches of border regression and classification. In order to reduce the number of convolution stacks from four to two, the authors first choose to replace the ordinary convolutions with depth separable convolutions. In terms of the number of channels, 256 dimensions are compressed to 96 dimensions. The reason why 96 is chosen is that the number of channels needs to be kept at a multiple of 8 or 16, which can enjoy the parallel acceleration of most reasoning frameworks. < p > < p > finally, the author of the project uses the method of Yolo series for reference, calculates the border regression and classification with the same set of convolutions, and then splits them into two parts. Finally, the lightweight detector head is shown in the figure below: < / P > < p > there are many improvements for FPN, such as bifpn used in efficientdet, pan used in Yolo V4 and V5, and balanced FPN, etc. Although the performance of bifpn is powerful, the stack feature fusion operation will reduce the running speed, while pan has only two paths, top-down and bottom-up, which is very simple and is a good choice for lightweight model feature fusion. < / P > < p > the original pan and the pan in Yolo series use the convolution of stripe = 2 to scale from large scale feature map to small scale. For the sake of lightweight, this project chooses to completely remove all convolutions in pan, and only retains the 1×1 convolution extracted from backbone network features to align the feature channel dimensions. The up sampling and down sampling are completed by interpolation. Different from the concatenate operation used by Yolo, the project author chooses to add the multi-scale feature maps directly, which makes the calculation of the whole feature fusion module very small. < / P > < p > the author of the project chooses shufflenetv2 1.0x as the backbone network. He removes the last layer of convolution of the network, and extracts 8,16,32 times down sampled features into pan for multi-scale feature fusion. The whole backbone model uses the code provided by torchvision, which can directly load the pre training weights of Imagenet provided by torchvision, which is of great help to speed up the model convergence. Inspired by neural architecture search, the author of this paper proposes to use interstellar as a cyclic architecture for processing information in relational paths. In addition, the new hybrid search algorithm breaks through the limitations of stand alone and one shot search methods, and is expected to be applied to other fields with complex search space. < / P > < p > original title: < A= target=_ blank>New product launch