Waterflai
  • Welcome to Waterflai
  • Getting Started
    • Concepts
    • Quickstart
  • Providers
    • Providers Overview
    • Providers setup
    • AI models
    • Choose the right models
  • Knowledge
    • Knowledge Overview
    • Knowledge connectors
    • Knowledge collections
  • Studio
    • Studio Overview
    • Studio Builders
      • Light Builder
      • Dream Builder
      • Workflow Builder
      • Flow components (nodes)
        • Input Node
        • Output Node
        • LLM model Node
        • Multimodal LLM Node
        • Dall-E 2 (image generation) Node
        • Dall-E 3 (image generation) Node
        • Sora video generation Node
        • Text-to-Speech (TTS) Node
        • Speech-to-Text (STT) Node
        • OCR Node
        • Agent Node
        • Reranker Node
        • Knowledge retrieval Node
        • Vector store insert Node
        • Vector store record delete Node
        • Gitbook loader
        • Notion Database Node
        • Figma Node
        • Webpage scraper Node
        • Sitemap Scraper Node
        • API Request Node
        • Document metadata extraction Node
        • Document metadata update Node
        • Character splitter Node
        • HTML splitter Node
        • Markdown Splitter
        • Calculator tool Node
        • Text as tool Node
        • Knowledge retrieval tool Node
        • Conditional Node
        • Iteration loop Node
      • Testing and Debugging
    • Publishing
    • Integration with API
    • Embedding in website
  • Analytics
    • Analytics Overview
    • Dashboards
    • Logs
  • Administration
    • Organization users
    • Workspace
    • Security and permissions
  • Troubleshooting
    • Support
Powered by GitBook
On this page
  • Overview
  • Configuration Settings
  • Output Ports
  • Best Practices
  • Common Issues
  1. Studio
  2. Studio Builders
  3. Flow components (nodes)

Multimodal LLM Node

Overview

The Multimodal LLM Node enables your flow to process and analyze both text and images using Large Language Models with multimodal capabilities. This node can handle complex tasks such as image description, visual question answering, and combined text-image analysis, making it ideal for applications that require understanding of both visual and textual content.

Usage cost: 1 credit

Configuration Settings

  1. Model Selection

    • Primary Model*: Select the main multimodal LLM model

    • Fallback Model: Optional backup model if primary fails

    • Temperature (0-1): Controls response randomness and creativity

      • Lower values (closer to 0): More focused, deterministic responses

      • Higher values (closer to 1): More creative, varied responses

  2. Prompts

    • System Prompt: Instructions/context for the model's behavior

    • Prompt*: The main instruction or question for the model

    • Images*: One or more image inputs for visual analysis

    • Past Message History: Optional chat history for context

  3. Output format

    • Output as JSON (Toggle):

      • Off (Default): The node outputs the model's response as a plain string.

      • On: Instructs the model to format its response as JSON and attempts to parse the output string. If parsing succeeds, the response output port will contain a JSON object/array. If it fails, response will contain the original string.

    Note: JSON structure and parsing success depend heavily on the model's ability to follow instructions. Clearly prompting for JSON format is recommended when this is enabled.

Output Ports

  • response (string): The model's generated response based on both text and image inputs

Best Practices

  1. Model Selection

    • Choose models that support multimodal processing

    • Ensure fallback models also have multimodal capabilities

    • Consider model-specific limitations for image processing

  2. Image Handling

    • Provide clear, high-quality images

    • Use appropriate image formats supported by the model

  3. Prompt Engineering

    • Be specific about what aspects of the images to analyze

    • Structure prompts to guide the model's attention

    • Include clear instructions for combining text and image analysis

    • Examples:

      • "Describe the main elements in this image and their relationship"

      • "Compare these two images and explain the differences"

      • "Based on the image and context, answer the following question..."

  4. Temperature Settings

    • Use lower temperatures (0.1-0.3) for:

      • Factual image descriptions

      • Technical analysis

      • Precise measurements or details

    • Use higher temperatures (0.6-0.9) for:

      • Creative interpretations

      • Brainstorming based on visual inputs

      • Generating varied descriptions

  5. Performance Optimization

    • Optimize image sizes before processing

    • Limit the number of images per request

    • Consider token limitations when combining images and text

Common Issues

  • Image processing timeouts with large or complex images

  • Token limit exceeded when processing multiple images

  • Inconsistent responses with high temperature settings

PreviousLLM model NodeNextDall-E 2 (image generation) Node

Last updated 1 month ago