Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

πŸ‘€ πŸ—£οΈ πŸ“12-in-1: Multi-Task Vision and Language Representation Learning Web Demo

License

Notifications You must be signed in to change notification settings

Cloud-CV/vilbert-multi-task

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

7 Commits

Repository files navigation

12-in-1: Multi-Task Vision and Language Representation Learning Web Demo

Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tasks overlap significantly. In this work, we investigate these relationships between vision-and-language tasks by developing a large-scale, multi-task training regime. Our approach culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multi-modal verification. Compared to independently trained single-task models, this represents a reduction from approximately 3 billion parameters to 270 million while simultaneously improving performance by 2.05 points on average across tasks. We use our multi-task framework to perform in-depth analysis of the effect of joint training diverse tasks. Further, we show that finetuning task-specific models from our single multi-task model can lead to further improvements, achieving performance at or above the state-of-the-art.

Arxiv Paper Link: https://arxiv.org/abs/1912.02315

Demo Link: https://vilbert.cloudcv.org/

If you have more questions about the project, then you can email us on team@cloudcv.org

Bulit & Maintained by -

Rishabh Jain

Acknowledgements

We thank Jiasen Lu for his help.

About

πŸ‘€ πŸ—£οΈ πŸ“12-in-1: Multi-Task Vision and Language Representation Learning Web Demo

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /