The first in the industry! Large scale multi camera common object scene data set messytable

As we all know, a smart woman can’t cook without rice. In today’s era of deep learning, a large number of data sets with good quality are like a piece of jade, waiting for the algorithm to carve. < p > < p > today, we introduce a large-scale multi camera general object scene data set messytable, which is jointly produced by Shangtang and Nanyang Polytechnic University of Singapore. The messytable includes 5500 + manually designed scenes, with more than 50000 pictures and 1.2 million densely labeled detection frames. The corresponding paper has been received by ECCV 2020. < / P > < p > in view of the difficulties in the application of multi camera system in real life, such as similar objects, dense occlusion, large angle difference and so on, we designed a large number of real, interesting and challenging scenes: we deployed cameras with multiple perspectives around the chaotic dining table, and its task is to associate the instances in different camera perspectives. The seemingly simple task requires the algorithm to be able to distinguish subtle appearance differences, obtain clues from adjacent regions, and skillfully use geometric constraints. At the same time, we propose a new algorithm to utilize the information around the scene. We hope that messytable can not only be used as a challenging baseline to point out the direction of follow-up research, but also as a highly realistic pre training source to open the way for algorithm landing. < / P > < p > question 1: what does messytable have to do with existing Reid and tracking? Question 2: what are the challenges of mesytable? Question 3: what is the size of messytable? Question 4: what are the design considerations of messytable? Question 5: what are the performances of various algorithms on messytable? Question 6: what are the unsolved problems of multi camera Association and the future research direction? Question 7: how can I use messytable? < p > < p > Reid and tracking can be understood as the association of instances in essence, and they often need to use appearance information. Although messytable is mainly used to study the association of instances in multi camera scenes, its challenges such as resolving subtle appearance differences, dense occlusion and large angle difference are common to other instance associations. We hope that messytable will become a common data set for instance association tasks and a test field for new algorithms in addition to serving multiple cameras. < / P > < p > 1. There is a large angle difference between cameras, and the appearance of instances varies greatly in different perspectives. < / P > < p > 3. Similar or identical objects, so it is not enough to use appearance based algorithm similar to traditional Reid. < / P > < p > 4. Objects are stacked close to the chaos in real life, which cannot be solved by using traditional homography matrix projection and other methods. < / P > < p > Figure 2: various challenges in messytable: a) partial occlusion; b) complete occlusion; c) similar objects; d) identical objects; E) and F) complex stacking < / P > < p > we compare the scale of other similar multi camera datasets in Table 1. Messytable includes 5500 + manually designed scenes, with more than 50000 images and 1.2 million densely labeled detection boxes, each with an instance ID. < / P > < p > scene difficulty level: we designed the scene of messytable into three difficulty levels. The more difficult the scene, there are more occlusion, similar objects and more objects outside the shared field of view. See Figure 3 for details. < / P > < p > Figure 3: a) three difficulty levels of scene examples; b) more instances of more difficult scenes; c) fewer instances of more difficult scenes in the shared field of view; d) more instances of the same objects in the more difficult scenes < / P > < p > multi camera settings: in order to study the influence of camera relative angle on the performance of association, we set up 9 cameras and 567 different camera units More than 20000 pairs of relative camera positions were generated. See Figure 4 for details. < / P > < p > Figure 4: a) uniform distribution of cameras in space; b) camera arrangement in acquisition; c) great diversity in relative camera angle distribution < / P > < p > selection of general items: We selected 120 kinds of common objects on the dining table: 60 kinds of supermarket goods, 23 kinds of fruits and vegetables, 13 kinds of pastry and 24 kinds of tableware, including various sizes, colors, textures and materials. In Figure 5, we have counted the frequency of these objects, and in Figure 6 we have a complete list of them. < / P > < p > we tested a variety of baseline algorithms. It is not surprising that the performance of homography projection is very poor, because the assumption that the key objects are all in the same plane is not true in complex scenes; the traditional method based on SIFT key point extraction is also not good, because there are few key points on non texture objects; patch matching methods based on deep learning, such as matchnet The effect of deep compare and deepdesc is average, while the baseline performance based on triplet structure has been greatly improved, but it is also limited by the inability to distinguish similar and identical objects. < / P > < p > Table 2: the performance of baseline of each algorithm on messytable shows that the algorithm combining appearance information, surrounding information and what information achieves the best effect. < / P > < p > we find that in addition to appearance information, the algorithm based on peripheral information achieves the best effect Information is very important. Therefore, we propose to include the information outside the detection frame. We call this operation zoom out. But we find that adding zoom out directly to the triplet network is not good, so we observe human behavior: a person will only seek clues from the surrounding when the information about the object’s characteristics is insufficient. Therefore, we propose asnet, which has appearance feature branch and surrounding feature branch, and uses a lambda coefficient to balance the two branches. When the appearance information of objects is similar, lambda’s design makes the network assign more weight to the surrounding information branches. < p > < p > asnet significantly improved the performance of association. The visualization of the feature map in Figure 8 shows that asnet has learned to take cues around the instance, while using zoom out directly is still focused on the instance itself. < / P > < p > figure 8: using zoom out directly still focuses on the instance itself, but asnet has learned to get clues from around the instance < / P > < p > we also found that adding a soft constraint based on epipolar geometry on the basis of asnet can continue to improve performance, proving that geometric information is complementary to appearance information and surrounding information. < / P > < p > it should be pointed out that although appearance information, surrounding information and feature information are used at the same time, the performance of the current algorithm is still unsatisfactory in the case of complex scenes and poor camera angle. < / P > < p > in Figure 9, we compare the performance of four strong algorithms in different camera angle differences, and find that the three metrics rapidly deteriorate when the camera angle difference becomes larger. < p > < p > Figure 9: the larger the camera angle difference, the worse the performance of the Association; Metrics: a) AP; b) fpr-95; c) ipaa-80 < / P > < p > in Table 3, we tested the performance of the model on three difficult sub datasets. The more difficult the subset has occlusion, the same objects, and fewer objects that appear in the shared field of view, so the performance of the model is worse. < / P > < p > more examples of failure include similar surrounding information when the same objects are put together or stacked together, as well as the penalty of geometric soft constraints. < p > < p > messytable has two main functions: as a highly directional baseline and as a pre training source associated with an instance. For the former, the algorithm that performs better on messytable also performs better on other multi camera datasets; for the latter, the model pre trained on messytable performs better on other datasets than on Imagenet. It is worth noting that the other three datasets we tested even include vehicles, pedestrians, and other categories that differ greatly from the general items in messytable. See Table 4 for details. < p > < p > we hope messytable can promote the research of novel algorithms and discover new problems in the field of instance Association. See our project home page for more details. 865 optimization is different? These mobile phones should teach you a lesson!