Waterflai
  • Welcome to Waterflai
  • Getting Started
    • Concepts
    • Quickstart
  • Providers
    • Providers Overview
    • Providers setup
    • AI models
    • Choose the right models
  • Knowledge
    • Knowledge Overview
    • Knowledge connectors
    • Knowledge collections
  • Studio
    • Studio Overview
    • Studio Builders
      • Light Builder
      • Dream Builder
      • Workflow Builder
      • Flow components (nodes)
        • Input Node
        • Output Node
        • LLM model Node
        • Multimodal LLM Node
        • Dall-E 2 (image generation) Node
        • Dall-E 3 (image generation) Node
        • Sora video generation Node
        • Text-to-Speech (TTS) Node
        • Speech-to-Text (STT) Node
        • OCR Node
        • Agent Node
        • Reranker Node
        • Knowledge retrieval Node
        • Vector store insert Node
        • Vector store record delete Node
        • Gitbook loader
        • Notion Database Node
        • Figma Node
        • Webpage scraper Node
        • Sitemap Scraper Node
        • API Request Node
        • Document metadata extraction Node
        • Document metadata update Node
        • Character splitter Node
        • HTML splitter Node
        • Markdown Splitter
        • Calculator tool Node
        • Text as tool Node
        • Knowledge retrieval tool Node
        • Conditional Node
        • Iteration loop Node
      • Testing and Debugging
    • Publishing
    • Integration with API
    • Embedding in website
  • Analytics
    • Analytics Overview
    • Dashboards
    • Logs
  • Administration
    • Organization users
    • Workspace
    • Security and permissions
  • Troubleshooting
    • Support
Powered by GitBook
On this page
  • Speech-to-Text (STT) Node
  • Overview
  • Configuration Settings
  • Outputs
  • Best Practices
  • Common Issues
  1. Studio
  2. Studio Builders
  3. Flow components (nodes)

Speech-to-Text (STT) Node

Speech-to-Text (STT) Node

Overview

The Speech-to-Text Node enables your flow to convert audio into text using advanced AI models. This node supports OpenAI, Azure OpenAI, and Groq providers, making it ideal for applications requiring audio transcription, voice command processing, or content accessibility features.

Usage cost: 2 credits

Configuration Settings

  1. Model Selection

    • Model*: Select from available OpenAI, Azure OpenAI, or Groq STT models

    • Note: Only OpenAI, Azure OpenAI, and Groq providers are currently supported

  2. Audio Input

    • Audio File*: Select an audio input from available audio variables

    • Supported Formats:

      • MP3 (.mp3)

      • MP4 (.mp4)

      • MPEG (.mpeg, .mpga)

      • M4A (.m4a)

      • WAV (.wav)

      • WebM (.webm)

      • FLAC (.flac)

      • OGG (.ogg, .oga)

  3. Transcription Settings

    • Language: Optional language specification

      • Auto-detect (default)

      • 60+ supported languages including English, Spanish, French, German, etc.

    • Temperature: Controls transcription variability (0.0 to 1.0)

      • Lower values (0.0): More focused and deterministic

      • Higher values (1.0): More variety in word choice

Outputs

  • text (string): Transcribed text from the audio input

Best Practices

  1. Audio Preparation

    • Use clear, high-quality audio recordings

    • Minimize background noise and interference

    • Ensure proper audio format compatibility

    • Keep file sizes reasonable for processing

  2. Language Selection

    • Use auto-detect for general transcription

    • Specify language for:

      • Improved accuracy with known language

      • Regional accent consideration

    • Consider target audience when selecting language

  3. Temperature Optimization

    • Use 0.0 for:

      • Technical content

      • Legal documents

      • Precise transcriptions

    • Use higher values for:

      • Creative content

      • Casual conversations

      • Multiple interpretation scenarios

  4. Format Selection

    • Use MP3 for:

      • General-purpose transcription

      • Balanced quality and file size

      • Wide compatibility

    • Use WAV or FLAC for:

      • High-fidelity requirements

      • Professional audio

      • Critical accuracy needs

Common Issues

  1. File Format Issues

    • Unsupported audio formats

    • Corrupted audio files

    • Invalid file extensions

  2. Transcription Quality

    • Poor audio quality affecting accuracy

    • Background noise interference

    • Multiple speakers overlap

    • Heavy accents or dialects

  3. API Limitations

    • File size restrictions

    • Rate limiting

PreviousText-to-Speech (TTS) NodeNextOCR Node

Last updated 3 months ago