Huawei Noah Ark Canada laboratory proposed Banet, two-way visual attention mechanism for monocular camera depth estimation

This is a work of applying bidirectional attention mechanism to the depth estimation of barrage. The main innovation is to introduce forward and backward attention modules on the basis of visual attention mechanism, which can effectively integrate local and global information to eliminate ambiguity. This paper also expands the application scope of visual attention mechanism, which is worth learning. < / P > < p > in this paper, bidirectional attention network is proposed, which is an end-to-end framework for monocular camera depth estimation. It solves the limitation of integrating local information and global information in convolutional neural network. The structure of the mechanism stems from the strong conceptual basis of neural machine translation, and a lightweight adaptive computational control mechanism similar to the dynamic characteristics of recurrent neural network is proposed. Two way attention modules are introduced, which use feedforward feature graph and global context to filter ambiguity. < / P > < p > a large number of experiments reveal the high ability of bidirectional attention model in feedforward baseline and other advanced methods, which can be used for monocular depth estimation on two challenging data sets Kitti and diode. We show that our method is superior to or at least equivalent to the most advanced monocular depth estimation method in performance, but has less memory and computational complexity. < / P > < p > classical computer vision methods use multi view stereo geometry correlation algorithm for depth estimation. Recently, deep learning based methods have transformed MDE into a dense, pixel level continuous regression problem. At present, the most advanced monocular depth estimation MDE model consists of the following modules: convolutional neural backbone network based on pre training and up sampling and jump connection module, global context module and logarithmic discretization module for ordinal regression, coefficient learner module for up sampling and local plane hypothesis. All of these design choices are directly or indirectly constrained by the spatial downsampling operation in the backbone architecture, which has been reflected in the pixel level task. < / P > < p > because MDE is a single plane estimation problem, based on this, the idea of depth to space transformation is added to the method in this paper as a remedial measure of sampling operation in decoding stage. However, direct D2S transformation of the final feature graph may be affected by the lack of global context for reliable estimation. Therefore, this method also injects global context information into the single channel feature graph after D2S transformation for the backbone model of each stage. In addition, the bi-directional attention module can effectively collect the information of all stages of the backbone model. < / P > < p > for the prominent problems in MDE method, this paper proposes a novel and effective method to estimate continuous depth map from a single image – bidirectional attention network. Although the architecture in this paper contains more connections than SOTA, because most of the interactions are calculated on the single channel features of D2S transformation, the computational complexity and the number of parameters are lower than other recent model methods. < / P > < p > Figure 1: an example of Barnet prediction on a Kitti validation set. Barnet improves the overall depth estimation by generating the attention weight of each stage and reducing the ambiguity in the network < / P > < p > 1. This paper is the first work to apply the concept of bidirectional attention mechanism to monocular camera depth estimation task. This method can be combined with any existing CNN. < / P > < p > 3. Extensive experimental results were carried out on two different MDE datasets. The effectiveness and efficiency of the proposed method are proved by experiments. At the same time, various variants of the proposed mechanism can be compared with the recent SOTA network structure. < / P > < p > in this paper, the method of two-way attention is first introduced by the field of neural machine translation. Although recent work has been done on CNN using channel orientation and spatial attention to complete various computer vision tasks, the idea of applying attention in a forward and backward manner to achieve the properties of bidirectional RNN has not been widely explored. < / P > < p > the overall architecture of the method proposed in this paper is shown in Figure 2. Similar to NMT terminology, the phased feedforward feature in Banet is similar to the individual words in the original sentence. At the same time, bidirectional RNN will dynamically process the source sentence word by word, thus inherently generating forward and backward hidden states. < / P > < p > due to the static nature of CNN in the input image, this method introduces two different attention modules, which are represented as forward and backward attention sub modules. The two-way attention module takes the phase feature graph as the input to filter the ambiguity by merging the global context. < / P > < p > because the task of MDE is a single plane estimation problem, the author uses a 1 × 1 convolution to adjust the characteristic images of each stage to the required spatial resolution, and then performs an efficient and parameterless depth to space operation. The forward and backward attention operations of the network can be expressed as: < / P > < p > where the superscript F and B represent the operations related to forward and backward attention respectively, and the subscript i represents the association stage of the backbone feature graph. The 9×9 convolution of forward attention can access the features of the ith stage, and accept the 9×9 convolution of backward attention from the ith stage; therefore, the forward and backward attention mechanisms of bidirectional RNN are simulated. Next, all forward and backward attention graphs are cascaded on the channel, and processed by a 3 × 3 convolution and softmax function to generate the attention weight a of each layer pixel level. Using D2S module to calculate the feature representation fi from the phased feature graph Si can be expressed as: < / P > < p > then, using Hadamard point multiplication and pixel wise summation operation, the nonlinear predictive value is calculated from the parallel feature F and attention feature graph a. Finally, the normalized prediction value is generated by the σ function. These operations can be expressed as: < / P > < p > in the D2S module, the global context can be combined by applying a relatively large convolution core to average pooling, and then performing full connection layer and bilinear up sampling operations. This operation combines pixel level and local information with image level and global information to extract better monocular cues from the whole image. This aggregation of global context in D2S module helps to solve the fuzziness of thinner objects in more challenging situations. In addition, the details of several alternative implementations of the proposed architecture are provided as follows: < / P > < p > Barnet Vanilla: only the backbone, followed by 1 × 1 convolution, single D2S operation and sigmoid to generate the final depth prediction. This is very similar to the model used for depth prediction in refined MPL network for monocular 3D detection. < / P > < p > Barnet Markov: this follows the Markov hypothesis, that is, the characteristics of each time step only depend on the previous or subsequent characteristics. Therefore, all edges are disabled in this construction except the immediately preceding and trailing edges of the 9 × 9 convolution in Fig. 2. In addition, by simply concatenating different stage features and similar post-processing together, the experiment of non time dependent structure is carried out. However, this naive implementation method is much worse than the time-dependent implementation method mentioned above. Therefore, this direct deployment is excluded from further experimental analysis. < / P > < p > evaluation index: in MDE literature, accuracy and error are used to measure different methods. However, there is a lack of consistency between the metrics used in different data sets. In this work, a unified set of indicators is used for experiments across all subsets of different data sets. < / P > < p > for error measurement, it mainly follows the measurement method in Kitti leaderboard. In addition, because the traditional accuracy index is a threshold measurement of relative prediction in an interval, the author modified the accuracy index to achieve a more stringent measurement purpose. Specifically, this set of indicators is expanded, and more thresholds are placed between the low end of the same range and the existing minimum threshold. The value of < / P > < p > k is: {5, 10, 15, 25, 56, 95}. Such a strict index expansion provides a better insight for the application of autopilot, in which the high accuracy of depth estimation is crucial. The results are compared with those of < sopnet II and II. Note that the performance of donn is particularly poor due to the granularity provided by the discretization or ordering level. For this reason, setting a higher value of the super parameter may improve its accuracy. However, increasing this amount will lead to an exponential increase in memory consumption during training; therefore, it is difficult to train on large data sets. < / P > < p > in general, the Barnet variant performs better or closer to the existing SOTA in specific BTS. It is worth noting that the number of parameters in the heaviest Barnet variable is 25% less than that in BTS. < / P > < p > Figures 4 and 5 show a qualitative comparison of the different SOTA methods. The Dorn prediction chart clearly shows the semi discreteness of the prediction. For the indoor and outdoor images in Figure 4, the high-intensity areas of BTS and densedepth surround the window frame at the bottom right of the left image and the tree trunk in the right image. < / P > < p > Barnet variants perform better in these vulnerable areas with poor performance. In particular, the global context aggregation of Barnet full has better performance than its local replica. Figure 5 shows a similar situation, but this time in the dark. From these visualizations, it is clear that global context aggregation is important to resolve the potential ambiguity caused by unfavorable lighting. < / P > < p > leave a message in the “creative partnership between human and technology” section of the third tweet of AI Technology Review on September 11 to talk about your views and expectations on this book. < / P > < p > 1. If you leave a message in the message area, the top five readers who have the highest praise and high quality will receive free books. Readers who have received the book please contact AI technology review customer service. < / P > < p > 3. The time of the event is from September 11, 2020 to September 18, 2020, and only one prize is allowed in the event push. Skip to content