These models generate text descriptions and captions from videos. They use large multimodal transformers trained on vast datasets that include both video content and corresponding text, such as captions, titles, and descriptions.
Key capabilities:
Featured models
Google’s hybrid "thinking" AI model optimized for speed and cost-efficiency
Updated 2 weeks, 1 day ago
489.7K runs
Generate Tiktok-Style Captions powered by Whisper (GPU)
Updated 1 year, 1 month ago
203.9K runs
Automatically add captions to a video
Updated 2 years ago
72.3K runs
Recommended Models
If you’re after quick turnaround for short clips, lucataco/qwen2-vl-7b-instruct is a strong choice—it’s designed to process short videos efficiently while maintaining descriptive accuracy.
Another practical option for speed is fictions-ai/autocaption, which is optimized for adding captions to videos and performs well for quick runs where ultra-low latency isn’t critical.
If you want good quality without excessive compute, lucataco/qwen2-vl-7b-instruct strikes a great balance. It supports detailed video understanding and performs well for most captioning and summarization tasks.
For more complex videos that require deeper reasoning or multiple scenes, lucataco/apollo-7b offers a richer understanding with slightly higher compute tradeoffs.
For social-style captioning—bold overlays, subtitles, and engaging visuals—fictions-ai/autocaption is purpose-built. It lets you upload a video and receive an output with clean, readable captions.
You can customize font, color, and subtitle placement, making it ideal for short-form content like Reels or TikToks.
If your goal is to generate textual descriptions of what’s happening in a video (instead of just overlaying captions), lucataco/qwen2-vl-7b-instruct supports video input and produces detailed visual reasoning outputs.
This makes it useful for accessibility captions, summaries, or content indexing.
There are two main types of models here:
Overlay caption models typically output a video with subtitles and sometimes an optional transcript file.
Vision-language models usually output text responses—scene descriptions, summaries, or even conversational answers about the video content.
Many captioning and vision-language models are open source and can be self-hosted using Cog or Docker.
To publish your own model, create a replicate.yaml file defining its inputs and outputs, push it to Replicate, and it’ll run automatically on managed GPUs.
Yes—most models in this collection allow commercial use, but always check the License section on the model’s page for specific terms.
If you’re adding captions to copyrighted content, ensure you have the right to modify and distribute that media.
Go to a model’s page on Replicate, upload your video, and click Run.
Models like fictions-ai/autocaption return a captioned video, while lucataco/qwen2-vl-7b-instruct and lucataco/apollo-7b generate text outputs that you can format or display however you like.
.srt or .vtt), confirm that the model supports transcript output.Recommended Models
MiniCPM-V 4.0 has strong image and video understanding performance
Updated 4 months, 2 weeks ago
278 runs
Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.
Updated 8 months, 4 weeks ago
31.4K runs
VideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding
Updated 10 months, 2 weeks ago
21.5K runs
Latest model in the Qwen family for chatting with video and image models
Updated 1 year ago
323.5K runs
Apollo 7B - An Exploration of Video Understanding in Large Multimodal Models
Updated 1 year ago
122.6K runs
Apollo 3B - An Exploration of Video Understanding in Large Multimodal Models
Updated 1 year ago
146 runs
Video Preprocessing tool for captioning multiple videos using GPT, Claude or Gemini
Updated 1 year ago
179 runs
CogVLM2: Visual Language Models for Image and Video Understanding
Updated 1 year, 3 months ago
671.8K runs
SOTA open-source model for chatting with videos and the newest model in the Qwen family
Updated 1 year, 4 months ago
606 runs
A multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-VL-Chat supports more flexible interaction, such as multi-round question answering, and creative capabilities.
Updated 2 years, 2 months ago
825.6K runs