gemini-2.5-flash: This was the primary model for all text and structured data generation. I used it with a strictly defined responseSchema to ensure the AI's output was always in a predictable JSON format. This model was responsible for:
Generating the costume's name, description, materials, and detailed text instructions.
Powering the search feature by creating five distinct costume concepts from a single prompt.
Handling the conversational "Refine" feature, where it would modify a costume based on follow-up user input.
imagen-4.0-generate-001: This powerful image generation model was used to create the crucial first image for each set of instructions, establishing the visual foundation for the step-by-step guide.
gemini-2.5-flash-image-preview: This versatile image editing model was the key to creating the app's most unique feature. It was used to generate all subsequent instruction images by taking the previous step's image as input and adding the new details described in the current step's text.
Multimodal Features
The app is built around two core multimodal functionalities that create a rich and intuitive user experience.
Vision Understanding: Image to Costume Idea
The ability for a user to upload an image and receive a relevant costume idea is a powerful multimodal feature. It goes beyond simple text prompts by allowing for visual context. A user can upload a picture of their pet, a favorite object, or a friend, and the AI can creatively interpret that visual data to generate a highly personalized and often unexpected costume concept. This makes the brainstorming process more personal and engaging.
Additive Image Generation: A Cohesive Visual Guide
The app's standout feature is its ability to create a set of instruction images that build upon one another. Instead of generating a new, disconnected image for each step, the system uses an iterative, multimodal process:
- Step 1: Generate a base image from a text prompt.
- Step 2+: Feed the image from the previous step plus the text for the current step into the image editing model (gemini-2.5-flash-image-preview).
Image steps
This creates a coherent visual narrative, allowing the user to literally watch the costume come together from one image to the next. This significantly enhances the user experience by making the instructions far easier to understand and follow compared to a series of isolated diagrams. It transforms the app from a simple idea generator into a true step-by-step visual crafting guide.