Home>Article>Technology peripherals> Rewritten title: Byte launches Vi-PRoM visual pre-training program to improve robot operation success rate and effect
In recent years, visual pre-training on large-scale real-world data has made significant progress, showing great potential in robot learning based on pixel observations. However, these studies differ in terms of pre-training data, methods, and models. Therefore, which type of data, pre-training methods and models can better assist robot control is still an open question
Based on this, researchers from the ByteDance Research team started fromThree basic perspectives of pre-training data set, model architecture and training methodComprehensively studied the impact of visual pre-training strategies on robot operation tasks, and provided some important experimental results that are beneficial to robot learning. In addition, they proposed a vision pre-training scheme for robot operation calledVi-PRoM, which combines self-supervised learning and supervised learning.The former uses contrastive learning to obtain latent patterns from large-scale unlabeled data, while the latter aims to learn visual semantics and temporal dynamic changes. A large number of robot operation experiments conducted in various simulation environments and real robots have proven the superiority of this solution.
##Pre-training data
EgoNet is more powerful than ImageNet. Pretrain visual encoders on different datasets (i.e., ImageNet and EgoNet) through contrastive learning methods and observe their performance in robot manipulation tasks. As can be seen from Table 1 below, the model pre-trained on EgoNet achieved better performance on robot operation tasks. Obviously, robots prefer the interactive knowledge and temporal relationships contained in videos in terms of operating tasks. In addition, the egocentric natural images in EgoNet have more global context about the world, which means that richer visual features can be learned##Model structure
ResNet-50 performs better. As can be seen from Table 2 below, ResNet-50 and ResNet-101 perform better than ResNet-34 on robot manipulation tasks. Furthermore, the performance does not improve as the model increases from ResNet-50 to ResNet-101.
Pre-training method
Needs to be rewritten according to the meaning of the original text The content is: "Contrastive learning is preferred for pre-training methods. As shown in Table 3 below, MoCo-v3 outperforms MAE on both ImageNet and EgoNet datasets, which proves that contrastive learning is more effective compared to mask image modeling. In addition , the visual semantics obtained through contrastive learning are more important for robot operation than the structural information learned through mask image modeling." Rewritten content: Contrastive learning is the preferred pre-training method. As can be seen from Table 3, MoCo-v3 outperforms MAE on both ImageNet and EgoNet datasets, indicating that contrastive learning is more effective than mask image modeling. In addition, the visual semantics obtained by contrastive learning are more important for robot operation than the structural information learned by mask image modeling
Algorithm Introduction
This research work conducted extensive experiments on two simulation environments (Franka Kitchen and MetaWorld). Experimental results show that the proposed pre-training scheme outperforms previous state-of-the-art methods in robot operation. The results of the ablation experiment are shown in the table below, which can prove the importance of visual semantic learning and temporal dynamic learning for robot operation. Furthermore, when both learning targets are absent, the success rate of Vi-PRoM drops significantly, demonstrating the effectiveness of the collaboration between visual semantic learning and temporal dynamic learning.
This work also investigates the scalability of Vi-PRoM. As shown in the figure below on the left, in the Franka Kitchen and MetaWorld simulation environments, the success rate of Vi-PRoM steadily improves as the size of the demo data increases. After training on a larger expert demonstration dataset, the Vi-PRoM model shows its scalability on robot manipulation tasks.
# Due to Vi-PRoM’s powerful visual representation capabilities, real The robot can successfully open drawers and cabinet doors
The experimental results on Franka Kitchen show that Vi-PRoM has a higher success rate and is more efficient than R3M in five tasks. High degree of action completion.
R3M:
##On MetaWorld, due to Vi- PRoM's visual representation learns good semantic and dynamic features, which can be better used for action prediction, so compared to R3M, Vi-PRoM requires fewer steps to complete the operation.
R3M:
#Vi-PRoM:
The above is the detailed content of Rewritten title: Byte launches Vi-PRoM visual pre-training program to improve robot operation success rate and effect. For more information, please follow other related articles on the PHP Chinese website!