Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Open source code for AAAI 2023 Paper "BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning"

License

MIT and 2 other licenses found

Licenses found

MIT
LICENSE
MIT
METER_LICENSE
Apache-2.0
ViLT_LICENSE
Notifications You must be signed in to change notification settings

microsoft/BridgeTower

BridgeTower

This repo is the official Pytorch implementation of "BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning".

Updates

Abstract

Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cross-modal encoder. Both approaches potentially restrict vision-language representation learning and limit model performance. In this paper, we propose BridgeTower, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni-modal encoders in the cross-modal encoder. Pre-trained with only 4M images, BridgeTower achieves state-of-the-art performance on various downstream vision-language tasks. In particular, on the VQAv2 test-std set, BridgeTower achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs. Notably, when further scaling the model, BridgeTower achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets.

Architecture

Architecture

Main Results

Result1

Result2

Deployment

  • Run setup.sh to set up the environment.
  • [Optional] We use wandb to track experiments! Please remember to wandb login and paste your token before running the script.

Dataset Preparation

Checkpoints

  • Pre-trained checkpoints on 4M data: BASE and LARGE

  • Fine-tuned checkpoints for

  • Here is an example for downloading a checkpoint.

    # download azcopy
    wget https://aka.ms/downloadazcopy-v10-linux
    tar -xvf downloadazcopy-v10-linux
    sudo cp ./azcopy_linux_amd64_*/azcopy /usr/bin/
    sudo chmod -R 777 /usr/bin/azcopy
    # azcopy copy [remote path] [local path]
    azcopy copy "https://chenfei.blob.core.windows.net/data/G/LCI/best_checkpoints/BridgeTower_pt_base.ckpt?sv=2020年10月02日&st=2022年11月24日T12%3A18%3A49Z&se=2027年11月25日T12%3A18%3A00Z&sr=b&sp=r&sig=BJigddAMHfNUtQuTGH8bJUrzAO3LfaeSm48AXUqZngY%3D" "./BridgeTower_pt_base.ckpt"

Pre-training on Image-Text Datasets

# Pre-train BridgeTower Base Model
bash scripts/pre_train.sh
# Pre-train BridgeTower Large Model
bash scripts/pre_train_large.sh

Fine-tuning on Downstream VL Tasks

  • VQAv2 Evaluation needs to submit the json file in the logs/ directory to eval.ai evaluation server to get the test-dev and/or test-std scores.
# Base Model on VQAv2 without VLP
bash scripts/ftfs_base_vqa.sh
# Large Model on VQAv2 without VLP
bash scripts/ftfs_large_vqa.sh
# Base Model on VQAv2 with VLP
bash scripts/ftfpt_base_vqa.sh
# Large Model on VQAv2 with VLP
bash scripts/ftfpt_large_vqa.sh
# Base Model on IRTR-Flickr30K with VLP (directly use ITM with multiple false texts)
bash scripts/ftfpt_base_irtr_f30k.sh
# Base Model on IRTR-Flickr30K with VLP (follow ALBEF to use ITC to sample hard negatives for ITM)
bash scripts/ftfpt_base_irtr_itm_itc_f30k.sh
# Base Model on SNLI-VE with VLP
bash scripts/ftfpt_base_snlive.sh
# Base Model on NLVR^2 with VLP
bash scripts/ftfpt_base_nlvr2.sh
# Base Model on IRTR-MSCOCO with VLP (follow ALBEF to use ITC to sample hard negatives for ITM)
bash scripts/ftfpt_base_irtr_itm_itc_coco.sh

Fine-tuning on Uni-Modal Tasks

# Base Model on CIFAR with VLP
bash scripts/ftfpt_base_cifar.sh
# Base Model on GLUE with VLP
bash scripts/ftfpt_base_glue.sh

Citation

@article{xu2022bridge,
 title={BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning},
 author={Xu, Xiao and Wu, Chenfei and Rosenman, Shachar and Lal, Vasudev and Che, Wanxiang and Duan, Nan},
 journal={arXiv preprint arXiv:2206.08657},
 year={2022}
}

Acknowledgement

We are highly grateful for the public code of the following papers, our code is partly based on them:

About

Open source code for AAAI 2023 Paper "BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning"

Resources

License

MIT and 2 other licenses found

Licenses found

MIT
LICENSE
MIT
METER_LICENSE
Apache-2.0
ViLT_LICENSE

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

AltStyle によって変換されたページ (->オリジナル) /