We propose an effective method to solve the problem of weak supervision in large-scale image localization. The purpose is to fully mine the difficult samples in representation learning by using self supervised learning method, and further refine image level supervision into regional level supervision, so as to better model the complex relationship between image and region. The model trained by this algorithm has strong robustness and generalization, and is verified on several image location data sets, [email protected] The accuracy is 5.7% higher than the state-of-the-art technology, and the code and model have been published. < / P > < p > image positioning technology aims to estimate the geographic location of the target image without the aid of GPS and other additional information. This technology is widely used in SLAM, AR/VR, mobile phone photo positioning and other scenes. At present, the research on image location can be divided into three directions: image retrieval based algorithm, 2D-3D matching algorithm and geographic location based classification algorithm. Among them, the scheme based on image retrieval is more feasible in large-scale long-term image positioning. The aim of image location based on retrieval is to identify the reference image which is most similar to the target image from a city scale database, and then estimate the geographic location of the target image by the geographic location of the reference image. Image location based on retrieval is also called location recognition. At present, there are mainly two kinds of data sets for image positioning: one is to crawl the image and corresponding GPS tags from the street view map directly. This type of data set does not need to be labeled manually, and has zero cost, which is easy to collect and scale up; the other is the data set with 6DOF camera position, which is usually collected by automatic driving vehicle, Collection costs are high. This work is based on the previous research, that is, the retrieval based image positioning algorithm is studied in the case of only GPS tags. The key of image retrieval is how to learn the discriminative image features. In the training of the model, both positive and negative samples are needed. Specifically, the model needs to learn to make the features of the target image close to the positive samples and away from the negative samples. In the GPS only image positioning data set, we can first filter through GPS, for example, the images within 10 m from GPS are potential positive samples. < / P > < p > however, as shown in the figure below, the same scene will not be captured when the images are close to each other in different directions, so there are still many false positive samples in the potential positive samples filtered only by GPS. So in model training, this is defined as a weak supervised learning problem. < / P > < p > if the target image is close to the false positive sample in training, it will lead to serious error amplification and even model collapse. Therefore, as shown in the figure below, the existing training algorithm makes the target image close to the image with the closest feature distance in the potential positive samples, also known as the top-1 / most similar image. < / P > < p > although this method can effectively reduce the probability of false positive samples, making model learning close to the most similar positive samples will make the training model lack the ability to adapt to various conditions and reduce the robustness of the model. In our opinion, difficult positive samples are indispensable in representation learning. However, simply using the Top-k image as a positive sample for learning has large noise. < / P > < p > as shown in the figure below, the Top-k image inevitably contains some false positive samples. In the comparative experiment, we also found that the result of simply using the Top-k image for training is not as good as the existing method of learning only top-1. Therefore, the key to the problem is how to use the Top-k image reasonably to mine difficult positive samples and reduce the interference of false positive samples on model training. < p > < p > specifically, for false positive samples or positive samples with small overlap area, we want to set smaller similarity labels; for positive samples with large overlap area with the target image, we want to set larger similarity labels. In this way, under the supervision of similarity tags, the model can simulate the distance relationship between the target image and different matching images, so as to carry out the representation learning pertinently. < / P > < p > then, how to obtain similarity tags? It is not feasible to predict directly from the data of the current model, which is similar to standing on one’s own feet, neither reaching a higher area, but standing unsteadily. Therefore, we propose an iterative training scheme to take the output of the first generation model as the supervision of the second generation model, and so on. Please note that “generation” refers to the whole process of a model from initialization training to convergence. < / P > < p > as shown in the figure below, the first generation model is trained by the scheme consistent with the existing algorithm. After training convergence, the second generation model is established and initialized, and the fixed first generation model is used to estimate the similarity label to train the second generation model. The accuracy of the prediction similarity label and the discrimination of the model are constantly updated and improved with the training iteration, thus forming a self-monitoring process. The idea of iterative training is related to the self distillation algorithm. The difference is that the self distillation algorithm mainly aims at the classification problem, distilling the classification prediction value with a fixed number of categories. However, we successfully apply the idea of iterative training to the image retrieval problem, and use the similarity tag we proposed for information iteration in the process of representation learning. < p > < p > above, we discussed how to reasonably mine difficult positive samples and reduce the interference of false positive samples on training. Even if there is no significant overlap between the region and the real sample, it is still very difficult to find the region. < / P > < p > as shown on the left side of the following figure, only using image level supervision will make all local features of the target image and the positive sample image tend to be similar, and such supervision will damage the discriminative learning of local features. Therefore, we propose that the ideal supervision should be at the regional level, as shown on the right side of the following figure, so that the positive region in the positive sample is close to the target image, while the negative region is far away from the target image. < / P > < p > in order to realize the region level supervision, we decompose the matched positive samples into four half regions and four quarter regions, and refine the image image similarity monitoring into the image region similarity monitoring. The iterative training method mentioned above is used for learning. Specifically, the image region similarity tags predicted by the first generation model are used to supervise the image region learning of the second generation model. < / P > < p > the following figure shows the experimental results. Our model is only trained on one data set, which can be well generalized to different test sets. For example, the most advanced accuracy has been achieved on Tokyo 24 / 7 and pitts250k test. Among them, the Tokyo 24 / 7 data set is the most difficult, because the illumination, angle, shooting device and other conditions of the image have a strong diversity [email protected] The accuracy is 5.7% higher than the most advanced sare algorithm. At the same time, we also open source the replication of netvlad and sare based on pytorch to facilitate the follow-up research and development. Continue Readinggather and watch! Huawei P40 Pro evaluation: excellent mobile phone photography elegant design, do you like it?