SOTVerse is a user-defined task space of single object tracking. It allows users to customize SOT tasks according to their research purposes, which on the one hand makes research more targeted, and on the other hand can significantly improve the efficiency of research. VLTVerse is the first fine-grained evaluation framework for VLT trackers that comprehensively considers multiple challenge factors and diverse semantic information, hoping to reveal the role of language in VLT.
The 3E paradigm aims to describe computer vision tasks by environment, evaluation, and executor: we synthesize the environment and evaluation to form SOTVerse -- a user-defined single object tracking task space, and conduct experiments in this space to judge executors' tracking ability. Definitely, this paradigm can be expanded to comprehensively describe other visual tasks and help users improve their research efficiency.
We organize existing benchmarks to form the environment of SOTVerse, which includes 12.56 million frames and frame-level challenging attribute labels to model the real world. Besides, an environment generation method is available to efficiently help researchers form their own task space.
We first point out the limitations of existing systems and indicators through detailed analysis; then design a new evaluation scheme for SOTVerse, which includes two mechanisms and new metrics to satisfy various tasks.
VLTVerse introduces 10 sequence-level challenge labels and 6 types of multi-granularity semantic information, creating a flexible and multi-dimensional evaluation space for VLT.
We conduct extensive experiments in the SOTVerse and VLTVerse and perform performance analysis on various executors. Experimental results indicate the shortcoming of existing work and verify the effectiveness of the evaluation scheme in SOTVerse and VLTVerse.
Please cite our paper if SOTVerse helps your research.
Please contact us if you have any problems or suggestions.