# Multimodal LLM Node

### Overview&#x20;

The Multimodal LLM Node enables your flow to process and analyze both text and images using Large Language Models with multimodal capabilities. This node can handle complex tasks such as image description, visual question answering, and combined text-image analysis, making it ideal for applications that require understanding of both visual and textual content.

Usage cost: 1 credit

### Configuration Settings

1. **Model Selection**
   * Primary Model\*: Select the main multimodal LLM model
   * Fallback Model: Optional backup model if primary fails
   * Temperature (0-1): Controls response randomness and creativity
     * Lower values (closer to 0): More focused, deterministic responses
     * Higher values (closer to 1): More creative, varied responses
2. **Prompts**
   * System Prompt: Instructions/context for the model's behavior
   * Prompt\*: The main instruction or question for the model
   * Images\*: One or more image inputs for visual analysis
   * Past Message History: Optional chat history for context
3. **Output format**

   * **Output as JSON** (Toggle):
     * **Off (Default):** The node outputs the model's response as a plain string.
     * **On:** Instructs the model to format its response as JSON and attempts to parse the output string. If parsing succeeds, the `response` output port will contain a JSON object/array. If it fails, `response` will contain the original string.

   *Note:* JSON structure and parsing success depend heavily on the model's ability to follow instructions. Clearly prompting for JSON format is recommended when this is enabled.

### Output Ports

* `response` (string): The model's generated response based on both text and image inputs

### Best Practices

1. **Model Selection**
   * Choose models that support multimodal processing
   * Ensure fallback models also have multimodal capabilities
   * Consider model-specific limitations for image processing
2. **Image Handling**
   * Provide clear, high-quality images
   * Use appropriate image formats supported by the model
3. **Prompt Engineering**
   * Be specific about what aspects of the images to analyze
   * Structure prompts to guide the model's attention
   * Include clear instructions for combining text and image analysis
   * Examples:
     * "Describe the main elements in this image and their relationship"
     * "Compare these two images and explain the differences"
     * "Based on the image and context, answer the following question..."
4. **Temperature Settings**
   * Use lower temperatures (0.1-0.3) for:
     * Factual image descriptions
     * Technical analysis
     * Precise measurements or details
   * Use higher temperatures (0.6-0.9) for:
     * Creative interpretations
     * Brainstorming based on visual inputs
     * Generating varied descriptions
5. **Performance Optimization**
   * Optimize image sizes before processing
   * Limit the number of images per request
   * Consider token limitations when combining images and text

### Common Issues

* Image processing timeouts with large or complex images
* Token limit exceeded when processing multiple images
* Inconsistent responses with high temperature settings


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.waterflai.ai/studio/studio-builders/flow-components-nodes/multimodal-llm-node.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
