As shown in Figure 1, the existing three-stage RGB-T single target tracking network usually uses two independent feature extraction branches, which are respectively responsible for extracting the features of the two modalities. However, mutually independent feature extraction branches will lead to a lack of effective information interaction between the two modalities in the feature extraction stage. Therefore, once the network completes offline training, it can only extract fixed features from each modal image and cannot dynamically adjust according to the actual modal state to extract more targeted dynamic features. This limitation restricts the network's ability to adapt to diverse target bimodal appearances and the dynamic correspondence between modal appearances. As shown in Figure 2, this feature extraction method is not suitable for practical application scenarios of RGB-T single target tracking, especially in complex environments, because the arbitrariness of the tracked target will lead to diverse bimodal appearances of the target, and The dynamic relationship between the two modalities also changes as the tracking environment changes. Three-stage fusion tracking cannot adapt to this situation well, resulting in an obvious speed bottleneck.
Except for the RGB-T single target tracking network based on Transformer, it uses direct addition or cascade to combine the features of the two modal search areas and input the prediction head for outputting the final prediction results. However, the video images provided by the current RGB-T single target tracking data set are not completely aligned, and not every modal search area can provide effective information, such as the RGB modal search area in dark night and hot cross-tracking scenarios. And the infrared outer search area will not be able to provide effective target appearance information, and there will be a lot of background noise. Therefore, merging features directly by element-wise addition or cascade does not take into account the problem of merging features in different search areas. To solve this problem, this paper proposes a new method called Fusion Feature Selection Module (FFSM). The FFSM module is mainly used to select search area features of target appearance with effective information. Specifically, the FFSM module first learns the weight of each search area feature through the attention mechanism. Then, the search area features are weighted and summed based on these weights to obtain the final fusion features. This mechanism can effectively filter out invalid background noise and extract target appearance information with higher importance, thereby improving RGB-T single target tracking performance. In order to verify the effectiveness of the FFSM module, we conducted experiments in the presence of a large amount of background noise. Experimental results show that the RGB-T single target tracking network using the FFSM module achieves better performance in target tracking compared with direct element-wise addition or cascade. In dark night and hot cross-tracking scenarios, the FFSM module can accurately select effective target appearance information, improving the accuracy and robustness of target tracking. In short, the introduction of the FFSM module effectively solves the problem of direct feature fusion and improves the performance of the RGB-T single target tracking network. This method can be widely used in the presence of a large amount of background noise
This article introduces an efficient single-stage RGB-T single target tracking network USTrack based on Transformer. Its core is to directly unify the three functional parts of the three-stage fusion tracking method into a ViT backbone network for simultaneous execution through joint feature extraction, fusion and correlation modeling methods, thereby achieving direct extraction of target templates and search under modal interaction. The fusion features of the region and construct the association modeling between the two fusion features, thus greatly improving the tracking speed and accuracy. In addition, USTrack also designed a feature selection mechanism based on modal reliability, which can reduce the interference of invalid modes by directly suppressing the generation of invalid modes, thereby reducing the impact of noise information on the final tracking results. Ultimately, USTrack created the fastest speed in current RGB-T single target tracking at 84.2FPS, and greatly reduced noise information by minimizing the position deviation of the target in the two modal images and mitigating the impact of invalid modal information on the tracking results. impact on the final forecast results.
The contributions of this article are as follows:
The current three-stage fusion tracking network has the problem of lack of modal interaction in the modal feature extraction stage. This chapter proposes a joint feature extraction & fusion & correlation modeling method. This method can directly extract the fusion features of the target template and the search area under the interaction of modalities, and simultaneously perform the correlation modeling operation between the two fusion features. For the first time, an efficient and concise single-stage fusion tracking paradigm is provided for the design of short-term RGB-T single target tracking network.
Without changing the meaning of the original text, adjust the sentence structure, "(2) For the first time, a feature selection mechanism based on modal reliability is proposed. This mechanism can evaluate the reliability of different modal images according to the actual tracking environment, and evaluate the reliability based on the reliability. Discard the fusion features generated by invalid modalities to reduce the impact of noise information on the final prediction results, thereby further improving tracking performance."
This article introduces the results on three mainstream RGB-T single target tracking benchmark data sets. A large number of experiments show that the method in this article not only achieves new SoTA performance, but also creates the fastest tracking speed of up to 84.2FPS. Especially on the VTUAV short-term tracker dataset and long-term tracking dataset, USTrack outperforms the best existing methods by 11.1%/11.7% and 11.3%/9.7% on MPR/MSR metrics.
As shown in Figure 3, the overall architecture of USTrack consists of three parts: dual embedding layers, ViT backbone network and feature selection mechanism based on modal reliability. Dual embedded layers consist of two independent embedded layers. This is considering that the attention mechanism obtains global information based on similarity, and the inherent performance of different modal data may cause the two modalities to have different feature representations for the same pattern. If the model is directly mapped through attention, This heterogeneity may limit the network's ability to model modal state shared information, thus affecting the subsequent feature fusion process. Therefore, USTrack uses two learnable embedding layers to map inputs corresponding to different modalities into a space that is conducive to fusion, to align the two modalities to a certain extent, and reduce the impact of modal intrinsics on feature fusion. . Then, all the outputs of the double embedding layer are jointly used as the input of the ViT backbone network, and are directly passed through the attention layer. It fuses modal information, feature fusion and target template fusion through attention, unifies the three functional stages of RGB-T tracking, and provides an efficient single-stage tracking paradigm for RGB-T tracking.
The feature selection mechanism based on pattern reliability is a prediction head and two reliability evaluation modules. It allows the two prediction heads to output different results, and based on the pattern reliability score, helps the network select the search area corresponding to the pattern that is more suitable for the current tracking scenario. The feature selection mechanism can be used in the final prediction to reduce the impact of noise information generated by invalid patterns on the final prediction result.
USTrack selected GTOT, RGB234 and VTUAV data sets as test benchmarks, and the test results are shown in Figure 4 Show. We also used VTUAV as a benchmark to analyze the performance of USTrack in different challenge scenarios. As shown in Figure 5, this article has screened out the six challenging attributes with the most obvious performance improvements. They are: deformation (DEF), scale change (SV), complete occlusion (FO), partial occlusion (PO), thermal crossover (TC) and extreme illumination (EI). Specifically, the deformation (DEF) and scale change (SV) challenge attributes can effectively demonstrate the differences in the appearance of the target during the tracking process. Full occlusion (FO), partial occlusion (PO), thermal crossover (TC) and extreme illumination (EI) challenge attributes can cause the appearance of the corresponding modal state to change or disappear, effectively demonstrating the dynamics of the target in different challenge scenarios relation. USTrack achieved the most significant performance improvements in tracking scenarios with these challenging attributes, and it can be evaluated that the joint feature extraction & fusion & correlation modeling approach can effectively alleviate the problem of insufficient interaction of modal features in the extraction stage in the three-stage fusion tracking paradigm, It can better adapt to the dynamic relationship between different appearances and modalities of the target during tracking.
As shown in Figure 6 and, in order to verify based on Regarding the effectiveness of the feature selection mechanism for modal reliability, we conducted comparative experiments on the dual prediction head structure with feature selection mechanism and several common prediction head structures on the RGBT234 benchmark data set, and gave the modal reliability The visualization result shows a good correspondence with the actual tracking scene.
This chapter proposes a Transformer-based Efficient single-stage short-term RGB-T single target tracking network USTrack. The core of USTrack is to propose a joint feature extraction & fusion & correlation modeling method to solve the problem of lack of modal interaction in the feature extraction stage of the traditional three-stage fusion tracking network. This enhances the tracking network's adaptability to diverse target bimodal appearances and the dynamic correspondence between modal appearances. On this basis, a feature selection mechanism based on modal reliability is further proposed. This mechanism reduces the impact of noise information on the final prediction result by directly discarding the fusion features generated by invalid modes, thereby achieving better tracking performance. USTrack achieves SoTA performance on three mainstream datasets and sets a new record for the fastest RGB-T tracking inference speed at 84.2 FPS. It is worth noting that on the currently largest RGB-T single target tracking benchmark data set VTUAV, this method increases the evaluation indicators MPR/MSR by 11.1%/11.7% and 11.3%/9.7% respectively compared with the existing SoTA method. , achieved a major performance breakthrough, adding a new and powerful baseline method to this benchmark data set.
1. Xia Jianqiang
Master’s student at the Institute of National Defense Science and Technology Innovation, Academy of Military Sciences. Research interests include visual image processing, target detection, single target tracking, etc. The first author published an article at the CCF Class A conference and won the first prize for Huawei in the 2022 "Huawei Cup" Fourth China Graduate Artificial Intelligence Innovation Competition.
2. Zhao Jian
Zhao Jian, head of the Multimedia Cognitive Learning Laboratory (EVOL Lab) of the China Telecom Artificial Intelligence Research Institute, a young scientist, and a researcher at the Institute of Optoelectronics and Intelligence of Northwestern Polytechnical University , graduated with a Ph.D. from the National University of Singapore. His research interests include multimedia analysis, local security, and embodied intelligence.
A total of 32 CCF-A papers were published on unconstrained visual perception understanding, and 31 papers were published as the first/corresponding author in international authoritative journals and conferences such as T-PAMI and CVPR, including one T- PAMI×2 (IF: 24.314), IJCV×3 (IF: 13.369), and the first inventor has authorized 5 national invention patents. Relevant technological achievements have been applied by six leading companies in the technology industry, including Baidu, Ant Financial, and Qihoo 360, and have produced significant benefits. He was selected into the "Young Talent Promotion Project" of the China Association for Science and Technology and the Beijing Association for Science and Technology, and hosted 6 projects including the National Natural Youth Science Fund. Won the Wu Wenjun Artificial Intelligence Outstanding Youth Award (2023), the first prize of the Wu Wenjun Artificial Intelligence Natural Science Award (2/5, 2022), the Singapore Pattern Recognition and Machine Intelligence Association (PREMIA) Lee Hwee Kuan Award, and the only best student of ACM Multimedia Paper Award (first work, 1/208, CCF-A conference, 2018), won the championship 7 times in important international scientific and technological events.
Serves as director of the Beijing Image and Graphics Society, editorial board member of the internationally renowned journals "Artificial Intelligence Advances" and "IET Computer Vision", guest editor of special issues of "Pattern Recognition Letters" and "Electronics", and senior field chairman of VALSE. Chairman of the ACM Multimedia 2021 sub-forum, Chairman of the CICAI 2022/2023 Area, Chairman of the CCBR 2024 Forum, senior member of the Chinese Society for Artificial Intelligence/China Image and Graphics Society, judge of the "Challenge Cup" College Student Science and Technology Works Competition, and member of the Expert Committee of the China Artificial Intelligence Competition wait.
Homepage: https://zhaoj9014.github.io
Screenshot of paper
Paper link
https://arxiv.org/abs/2308.13764
##Code link
https://github.com/xiajianqiang
The above is the detailed content of Efficient single-stage short-term RGB-T single target tracking method based on Transformer. For more information, please follow other related articles on the PHP Chinese website!