Facebook paper sharing details of oculus quest hand tracking technology

At the 2020 computer vision and pattern recognition conference held in June this year, Michael abrash, chief scientist of Facebook reality labs, introduced the team’s research and related progress via video. < p > < p > among them, abrash demonstrated the optimized hand tracking function, and said that the system developed by Facebook has been able to track the rapid movement of hands and fingers with considerable accuracy. In addition, he pointed out that optical hand and finger tracking will become an important component of the space computer paradigm. < / P > < p > most previous studies on hand tracking focused on external depth cameras or RGB cameras. The depth camera can provide 2.5D point cloud hand geometry image. However, depth camera requires additional hardware design and power consumption. In contrast, RGB cameras are easier to integrate, and with the progress of deep learning technology, their practicability is also improving. Therefore, using a single RGB camera and neural network to predict hand posture has become a hot research topic. < p > < p > Facebook reality labs mainly proposes a real-time hand tracking system for driving virtual reality and augmented reality experience. Using four fisheye monochromatic cameras, the system can generate accurate and low jitter 3D hands. The researchers mainly use the neural network architecture to detect the hand and estimate the position of the key points in the hand. The hand detection network can handle various real-world environments reliably, while the key point estimation network uses tracking history to generate spatiotemporal consistent posture. At the same time, the team designed an extensible semi-automatic mechanism to collect a large number of different ground truth data through manual annotation and automatic tracking. In addition, the researchers introduced a tracking detection method to reduce the calculation cost and improve the smoothness. The optimized system can run at 60Hz on PC and 30Hz on mobile processor. < / P > < p > the following figure outlines the hand tracking system developed by Facebook. The team starts with images from four monochrome cameras, detects left and right hands in each image, and generates a set of bounding boxes. Then, each bounding box is cut out from the image and transmitted to the network which can detect 21 key points. The related hand model is divided into two parts: one is hand skeleton s; the other is mesh model M. The hand skeleton s is composed of 26 degrees of freedom, of which 6 degrees of freedom represent global transformation and 4 degrees of freedom represent finger joints. Specifically, the task of hand detection is to find the bounding box of each hand in each input image. A key challenge is to ensure robustness to a variety of real-world environments. In order to meet this challenge, the team collected a large number of different hand detection data sets using semi-automatic labeling method, and proposed a simple and efficient CNN architecture: DetNet < / P > < p > since any input has a fixed number of outputs, the team designed DetNet to directly regress the 2D center and scalar radius of each hand from the VGA resolution input image, and performed various calculations Normal function to predict the relevant bounding box. Then, the key point estimation network keynet predicts 21 key points about the hand from the image according to the bounding box predicted in the hand detection step. < / P > < p > previous studies on key point estimation usually process each image independently. For real-time multi camera systems, there are several disadvantages. First, when the hand moves between overlapping camera head views, the quality of prediction is reduced because each view is processed independently; second, key points are prone to jitter because time consistency is not enforced. To solve these two problems, the researchers designed the network to explicitly incorporate the inferred key points into an additional network input. Four VGA synchronous global shutter cameras were used to drive the hand tracking system. Each camera has a field of view range of 150 degrees, 120 degrees and 175 degrees. The center area on the right is covered by two or more cameras to ensure the most accurate tracking within the area. < / P > < p > in order to generate key point labels for keynet training, the researchers used a depth based manual tracking system to generate groundtruth key notes and projected the generated key points into several calibrated monochromatic views. < / P > < p > as shown above, six 60Hz monochrome fisheye cameras are placed on a rigid frame, while a 50 Hz monochrome depth camera is used to capture and mark hand movements. Cameras register with each other in space and time, so the key points generated by the hand tracker can be re projected and interpolated into the monochrome view. In addition, due to the mobility of the capture device, it can quickly capture the changes of light and environment. The bounding box label is very important for training an accurate data network. To maximize the throughput and efficiency of labeling tasks, researchers used an innovative semi-automatic solution to label bounding boxes. After manually marking the initial frame’s hand bounding box, the team used a trained keynet and a tracking pipeline to transmit gestures. If you notice that the tracker fails, the annotator simply annotates a new box and the tracked hand is automatically updated. < / P > < p > the researchers used generic, calibrated, and scanned methods to describe the default hand model. The middle part of Table 1 uses the hand model obtained by scanning system. We can see that the mkpe generated by the team’s keynet is similar to that of the baseline keynet-s, but the MKA is significantly reduced in both stereo and monocular images. This shows that the keynet architecture can effectively improve the time smoothness. < / P > < p > the bottom section in Table 1 illustrates the importance of resolving hand proportions. Compared with the hand model obtained by scanning system, the accuracy of the general hand model is greatly reduced. This problem is more serious when the tracker is running in monocular mode, because resolving depth ambiguity in a single view depends heavily on the accuracy of the hand model scale. The tracking accuracy is close to the hand model obtained by three-dimensional scanning using the hand scale decomposition method proposed by the team. < / P > < p > of course, the above solutions still fail, such as in complex hand hand hand and hand object interactions, and when unusual hand views occur. The researchers admit that the failure of hand hand interaction and hand object interaction reflects the design limitations of the system. Looking forward to the future, the team will continue to work to improve the accuracy and robustness of the tracking system. Developed a “plug and play” solar power generation scheme, and “5B” won a $12 million round a financing