Home>Article>Technology peripherals> Rewritten title: Byte launches Vi-PRoM visual pre-training program to improve robot operation success rate and effect

Rewritten title: Byte launches Vi-PRoM visual pre-training program to improve robot operation success rate and effect

王林 forward: 2023-09-13 10:57:02 786browse

In recent years, visual pre-training on large-scale real-world data has made significant progress, showing great potential in robot learning based on pixel observations. However, these studies differ in terms of pre-training data, methods, and models. Therefore, which type of data, pre-training methods and models can better assist robot control is still an open question

Based on this, researchers from the ByteDance Research team started fromThree basic perspectives of pre-training data set, model architecture and training methodComprehensively studied the impact of visual pre-training strategies on robot operation tasks, and provided some important experimental results that are beneficial to robot learning. In addition, they proposed a vision pre-training scheme for robot operation calledVi-PRoM, which combines self-supervised learning and supervised learning.The former uses contrastive learning to obtain latent patterns from large-scale unlabeled data, while the latter aims to learn visual semantics and temporal dynamic changes. A large number of robot operation experiments conducted in various simulation environments and real robots have proven the superiority of this solution.

##Paper address: https://arxiv.org/pdf/2308.03620.pdf
Project address: https://explore-pretrain-robot.github.io/

Benchmark Research

##Pre-training data

EgoNet is more powerful than ImageNet. Pretrain visual encoders on different datasets (i.e., ImageNet and EgoNet) through contrastive learning methods and observe their performance in robot manipulation tasks. As can be seen from Table 1 below, the model pre-trained on EgoNet achieved better performance on robot operation tasks. Obviously, robots prefer the interactive knowledge and temporal relationships contained in videos in terms of operating tasks. In addition, the egocentric natural images in EgoNet have more global context about the world, which means that richer visual features can be learned

Rewritten title: Byte launches Vi-PRoM visual pre-training program to improve robot operation success rate and effect

##Model structure

ResNet-50 performs better. As can be seen from Table 2 below, ResNet-50 and ResNet-101 perform better than ResNet-34 on robot manipulation tasks. Furthermore, the performance does not improve as the model increases from ResNet-50 to ResNet-101.

Rewritten title: Byte launches Vi-PRoM visual pre-training program to improve robot operation success rate and effect

Pre-training method

Needs to be rewritten according to the meaning of the original text The content is: "Contrastive learning is preferred for pre-training methods. As shown in Table 3 below, MoCo-v3 outperforms MAE on both ImageNet and EgoNet datasets, which proves that contrastive learning is more effective compared to mask image modeling. In addition , the visual semantics obtained through contrastive learning are more important for robot operation than the structural information learned through mask image modeling." Rewritten content: Contrastive learning is the preferred pre-training method. As can be seen from Table 3, MoCo-v3 outperforms MAE on both ImageNet and EgoNet datasets, indicating that contrastive learning is more effective than mask image modeling. In addition, the visual semantics obtained by contrastive learning are more important for robot operation than the structural information learned by mask image modeling

Rewritten title: Byte launches Vi-PRoM visual pre-training program to improve robot operation success rate and effect Algorithm Introduction

Based on the above exploration, this research proposes a visual pre-training solution for robot operation (Vi-PRoM). This solution extracts a comprehensive visual representation of robot operations by pre-training ResNet-50 on the EgoNet dataset. Specifically, we first use contrastive learning to obtain the interaction patterns between people and objects from the EgoNet data set through self-supervision. Then, two additional learning objectives, namely visual semantic prediction and temporal dynamic prediction, are proposed to further enrich the encoder's representation. The figure below shows the basic process of Vi-PRoM. Notably, this study does not require manual labeling to learn visual semantics and temporal dynamics

Rewritten title: Byte launches Vi-PRoM visual pre-training program to improve robot operation success rate and effect

Experimental results

This research work conducted extensive experiments on two simulation environments (Franka Kitchen and MetaWorld). Experimental results show that the proposed pre-training scheme outperforms previous state-of-the-art methods in robot operation. The results of the ablation experiment are shown in the table below, which can prove the importance of visual semantic learning and temporal dynamic learning for robot operation. Furthermore, when both learning targets are absent, the success rate of Vi-PRoM drops significantly, demonstrating the effectiveness of the collaboration between visual semantic learning and temporal dynamic learning.

Rewritten title: Byte launches Vi-PRoM visual pre-training program to improve robot operation success rate and effect

This work also investigates the scalability of Vi-PRoM. As shown in the figure below on the left, in the Franka Kitchen and MetaWorld simulation environments, the success rate of Vi-PRoM steadily improves as the size of the demo data increases. After training on a larger expert demonstration dataset, the Vi-PRoM model shows its scalability on robot manipulation tasks.

Rewritten title: Byte launches Vi-PRoM visual pre-training program to improve robot operation success rate and effect

# Due to Vi-PRoM’s powerful visual representation capabilities, real The robot can successfully open drawers and cabinet doors

The experimental results on Franka Kitchen show that Vi-PRoM has a higher success rate and is more efficient than R3M in five tasks. High degree of action completion.

R3M:

Rewritten title: Byte launches Vi-PRoM visual pre-training program to improve robot operation success rate and effect

##Vi-PRoM:

Rewritten title: Byte launches Vi-PRoM visual pre-training program to improve robot operation success rate and effect ##On MetaWorld, due to Vi- PRoM's visual representation learns good semantic and dynamic features, which can be better used for action prediction, so compared to R3M, Vi-PRoM requires fewer steps to complete the operation.

R3M:

Rewritten title: Byte launches Vi-PRoM visual pre-training program to improve robot operation success rate and effect

#Vi-PRoM：

Rewritten title: Byte launches Vi-PRoM visual pre-training program to improve robot operation success rate and effect

The above is the detailed content of Rewritten title: Byte launches Vi-PRoM visual pre-training program to improve robot operation success rate and effect. For more information, please follow other related articles on the PHP Chinese website!

架构 github 算法 https

Statement：

This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete

Previous article：Apple’s first consumer VR headset is about to be released! What do we use VR for today? Next article：Apple’s first consumer VR headset is about to be released! What do we use VR for today?

See more

Rewritten title: Byte launches Vi-PRoM visual pre-training program to improve robot operation success rate and effect

Benchmark Research

Experimental results

Related articles