- Get Started
- Image & Video APIs overview
- Developer kickstart
- SDK quick starts
- Try it!
- MCP servers and LLM tools (Beta)
- VS Code Extension (Beta)
- Video tutorial library
- Programmatic asset management
- Programmatic upload
- Upload programmatically
- Create upload presets
- Auto upload
- Moderate images with AI
- Use AI to generate image captions
- Upload images in Flutter
- Upload images in Node.js
- Auto-tag images in Node.js
- Upload multiple files in Node.js
- Upload videos in Node.js
- Upload images in Python
- Auto-tag images in Python
- Upload videos in Python
- Upload images in PHP
- Upload images in Go
- Upload assets in a React app
- Upload assets in a Vue.js app
- Drag-and-drop uploads in React
- Upload assets in a Next.js app
- Upload assets with Server Actions
- Upload assets in Svelte
- Upload assets in a SvelteKit app
- Upload assets in a Remix app
- Upload images in Hono
- Use webhooks to remove backgrounds
- Upload images in Laravel
- Interaction with Cloudinary APIs
- Cloudinary SDKs
- Find your credentials
- Configure the JavaScript SDK
- Configure the React SDK
- Configure the Angular SDK
- Configure the Flutter SDK
- Configure Svelte Cloudinary
- Configure the Javascript SDK in Svelte
- Getting started with Cloudinary in Node.js
- Configure the Node.js SDK
- Configure the Python SDK
- Configure the PHP SDK
- Configure the Go SDK
- Configure the Vue.js SDK
- Configure the Next.js SDK
- List images in Next.js
- Lazy load images with Next.js
- Image fallbacks in JavaScript
- Captioning on upload with Node.js
- Delete assets with Node.js
- Manage images in a Django app
- Cloudinary CLI
- Widgets and tools
- Programmatic upload
- Optimization and delivery
- Transformations
- Get started with transformations
- Advanced transformation features
- Text overlay transformations
- Complex transformations
- Named transformations
- Named transformations using TX Builder
- Advanced image components
- Trim videos in Node.js
- Splice videos in Node.js
- Zoompan effect
- Video transformations
- Crop and resize images in React
- Crop and resize videos in React
- Crop and resize images in Python
- Remove backgrounds and add drop shadows
- AI generative fill in Next.js
- Color accessibility in JavaScript
- Transformations for social media
- Dev Hints on YouTube
- Dev Hints en Español
- Cloudinary Café Training Sessions
- Programmatic asset management
- Additional onboarding resources
- Guides
- Cloudinary Image
- Product overview
- Image transformations
- Image transformations overview
- Resizing and cropping
- Placing layers on images
- Effects and enhancements
- Background removal
- Generative AI transformations
- Face-detection based transformations
- Custom focus areas
- Transformation refiners
- Animated images
- Transformations on 3D models
- Conditional transformations
- User-defined variables and arithmetic transformations
- Custom functions
- Image optimization and delivery
- Programmatic image creation
- Product Gallery widget
- Media Editor widget
- Image add-ons
- Cloudinary Video
- Upload
- Asset management
- Account management
- Retail and e-commerce
- User-generated content
- Accessible media
- AI in action
- Native mobile
- Add-ons
- Advanced Facial Attributes Detection
- Amazon Rekognition AI Moderation
- Amazon Rekognition Video Moderation
- Amazon Rekognition Auto Tagging
- Amazon Rekognition Celebrity Detection
- Aspose Document Conversion
- Cloudinary AI Background Removal
- Cloudinary AI Content Analysis
- Cloudinary AI Vision
- Cloudinary Duplicate Image Detection
- Google AI Video Moderation
- Google AI Video Transcription
- Google Auto Tagging
- Google Automatic Video Tagging
- Google Translation
- Imagga Auto Tagging
- Imagga Crop and Scale
- Perception Point Malware Detection
- Microsoft Azure Video Indexer
- OCR Text Detection and Extraction
- Pixelz - Remove the Background
- URL2PNG Website Screenshots
- VIESUSTM Automatic Image Enhancement
- WebPurify Image Moderation
- Cloudinary Image
- References
- SDKs
- Release Notes
Cloudinary AI Vision
Last updated: Aug-31-2025
Cloudinary is a cloud-based service that provides solutions for image and video management. These include server or client-side upload, on-the-fly image and video transformations, fast CDN delivery, and a variety of asset management options.
The Cloudinary AI Vision add-on is a service utilizing LLM (Large Language Model) capabilities, specialized models, advanced algorithms, prompt engineering, and Cloudinary's knowledge, to interpret and respond to visual content queries, providing answers to questions (e.g., "Are there flowers?") and requests (e.g., "Describe this image") about an image's content. By seamlessly integrating visual and textual data, AI Vision provides a more holistic and adaptable understanding of content, enabling businesses to tailor solutions that align closely with their unique brand and customer expectations, thus securing a substantial competitive advantage.
AI Vision is designed to cater to a variety of needs across different industries, streamlining content moderation, media classification and understanding content, and providing a powerful tool that automates the analysis, tagging, and moderation of visual content.
asset_id of an image in your Cloudinary account, or a valid uri to an image. Getting started
Before you can use the Cloudinary AI Vision add-on:
You must have a Cloudinary account. If you don't already have one, you can sign up for a free account.
Register for the add-on: make sure you're logged in to your account and then go to the Add-ons page. For more information about add-on registrations, see Registering for add-ons.
Keep in mind that many of the examples on this page use our SDKs. For SDK installation and configuration details, see the relevant SDK guide.
If you're new to Cloudinary, you may want to take a look at the Developer Kickstart for a hands-on, step-by-step introduction to a variety of features.
Overview
AI Vision offers scalable solutions for handling large volumes of media assets to provide a seamless, ready-to-use experience, enabling users to integrate effortlessly without having to do any complex customizations or prompt engineering. The add-on supports the following modes:
- Tagging - Automatically tag images based on provided definitions.
- Moderation - Evaluate images against specific moderation questions.
- General - Gain insights from images by asking open-ended questions.
Tagging mode
The Tagging mode accepts a list of tag names along with their corresponding descriptions. If the image matches the description, which may encompass various elements, the response will be appropriately tagged. This approach enables customers to align with their own brand taxonomy, offering a dynamic, flexible, and open method for image classification.
To return the tags for an image based on provided definitions you call the ai_vision_tagging method with the following parameters:
-
source: The image to be analyzed. Either aurior anasset_idcan be specified. -
tag_definitions: A list of tag definitions containing names and descriptions (max 10).
Example Request:
Example Response:
Moderation mode
The Moderation mode accepts multiple questions about an image, to which the response provides concise answers of "yes," "no," or "unknown." This functionality allows for a nuanced evaluation of whether the image adheres to specific content policies, creative specs, or aesthetic criteria.
To evaluate images against specific moderation questions you call the ai_vision_moderation method with the following parameters:
-
source: The image to be analyzed. Either aurior anasset_idcan be specified. -
rejection_questions: A list of yes/no questions to ask (max 10).
Example Request:
Example Response:
General mode
The General mode serves a wide array of applications by providing detailed answers to diverse questions about an image. Users can inquire about any aspect of an image, such as identifying objects, understanding scenes, or interpreting text within the image.
To ask general questions you call the ai_vision_general method with the following parameters:
-
source: The image to be analyzed. Either aurior anasset_idcan be specified. -
prompts: A list of questions or requests to ask (max 10).
Example Request:
Example Response:
Tokens
Your AI Vision Add-on quota is based on tokens. A token is a unit of measurement, similar to a word, used to quantify the processing required. Tokens can represent both text and images, with pricing based on the number of tokens processed.
- Input tokens: Data sent to AI Vision, like text or images. Images are treated as input and converted into tokens.
- Output tokens: Data generated by AI Vision in response, like text descriptions.
Consolidating into token count provides a clear understanding of the total token used.
Every response also includes a limits node with the number of tokens used by the operation. For example: